Add support for Evals

Evals a big part of the value that standardized environments provide. They allow others to fully reproduce results for the same model under the same benchmark. This is _very_ important for everyone developing models, as there are so many tiny variables that make it hard to reproduce the same score for the same model as what you read in a paper. Everything contributes:

1. The version of libraries. For example, the popular `antlr` package at some point changed how they matched ASTs, which in turn affected the scores of benchmarks that relied on this like MATH.
2. The actual implementation of a benchmark when it doesn't have reference "source of truth" code. The handling of corner cases matters.
3. Inference parameters. Even batch size make inference non-deterministic in the presence of low-precision arithmetic (see [this blog](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/)). We do not solve this problem since we don't own your inference, but we can make everything else more reproducible.

Note that we should NOT set out to build our own eval harness. There are plenty of capable harnesses already. We should simply **integrate** them into OpenEnv.

Finally, these harnesses will bring dependencies. We should only bring them in simulation mode, leaving prod mode unaffected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Evals #384

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for Evals #384

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions