Skip to content

tests implmentation affects pass@1 in polyglot-benchmark #11

@ppx123-web

Description

@ppx123-web

Issue

In the Aider Python track’s ‘grep’ exercise, I’ve found an issue relating to the test implementation of the built-in open() function. The test suite employs a mock for open():

def open_mock(fname, *args, **kwargs):
    try:
        return io.StringIO(FILE_TEXT[fname])
    except KeyError:
        raise RuntimeError(
            "Expected one of {0!r}: got {1!r}".format(list(FILE_TEXT.keys()), fname)
        )

This specific mock implies that the solver’s code is compelled to directly attempt to open() a file (and handle the resulting RuntimeError from the mock, or FileNotFoundError in a real scenario) rather than first checking for file existence using os.path.exists().

However, the problem specification for the exercise does not explicitly describe this constraint. From the perspective of standard Python programming practices and a literal interpretation of the problem instructions, using os.path.exists() to verify a file’s presence before attempting to open it is a perfectly reasonable and common approach.

When evaluating LLMs/agents using the Aider benchmark, this mismatch between the test’s implicit requirement and the problem’s lack of explicit guidance leads to a failure in a one-shot attempt for this specific case, as some models default to the more intuitive os.path.exists() strategy, which then fails the mock-based test.

Proposed Solutions
To address this ambiguity and improve the fairness of the benchmark, I propose the following solutions:

Direct File Provision: Similar to the JavaScript track, consider providing actual .txt files in the testing environment rather than fully mocking the open() function with io.StringIO.
Comprehensive File Operation Mocking: Implement a more extensive mock for file system operations that also covers functions like os.path.exists(), ensuring it responds predictably to simulate file existence or non-existence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions