-
Notifications
You must be signed in to change notification settings - Fork 186
Open
Description
Evals a big part of the value that standardized environments provide. They allow others to fully reproduce results for the same model under the same benchmark. This is very important for everyone developing models, as there are so many tiny variables that make it hard to reproduce the same score for the same model as what you read in a paper. Everything contributes:
- The version of libraries. For example, the popular
antlrpackage at some point changed how they matched ASTs, which in turn affected the scores of benchmarks that relied on this like MATH. - The actual implementation of a benchmark when it doesn't have reference "source of truth" code. The handling of corner cases matters.
- Inference parameters. Even batch size make inference non-deterministic in the presence of low-precision arithmetic (see this blog). We do not solve this problem since we don't own your inference, but we can make everything else more reproducible.
Note that we should NOT set out to build our own eval harness. There are plenty of capable harnesses already. We should simply integrate them into OpenEnv.
Finally, these harnesses will bring dependencies. We should only bring them in simulation mode, leaving prod mode unaffected.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels