Skip to content

Add support for Evals #384

@Darktex

Description

@Darktex

Evals a big part of the value that standardized environments provide. They allow others to fully reproduce results for the same model under the same benchmark. This is very important for everyone developing models, as there are so many tiny variables that make it hard to reproduce the same score for the same model as what you read in a paper. Everything contributes:

  1. The version of libraries. For example, the popular antlr package at some point changed how they matched ASTs, which in turn affected the scores of benchmarks that relied on this like MATH.
  2. The actual implementation of a benchmark when it doesn't have reference "source of truth" code. The handling of corner cases matters.
  3. Inference parameters. Even batch size make inference non-deterministic in the presence of low-precision arithmetic (see this blog). We do not solve this problem since we don't own your inference, but we can make everything else more reproducible.

Note that we should NOT set out to build our own eval harness. There are plenty of capable harnesses already. We should simply integrate them into OpenEnv.

Finally, these harnesses will bring dependencies. We should only bring them in simulation mode, leaving prod mode unaffected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions