tests implmentation affects pass@1 in polyglot-benchmark

### Issue

In the Aider Python track’s ‘grep’ exercise, I’ve found an issue relating to the test implementation of the built-in open() function. The test suite employs a mock for open():

```Python

def open_mock(fname, *args, **kwargs):
    try:
        return io.StringIO(FILE_TEXT[fname])
    except KeyError:
        raise RuntimeError(
            "Expected one of {0!r}: got {1!r}".format(list(FILE_TEXT.keys()), fname)
        )
```
This specific mock implies that the solver’s code is compelled to directly attempt to open() a file (and handle the resulting RuntimeError from the mock, or FileNotFoundError in a real scenario) rather than first checking for file existence using os.path.exists().

However, the problem specification for the exercise does not explicitly describe this constraint. From the perspective of standard Python programming practices and a literal interpretation of the problem instructions, using os.path.exists() to verify a file’s presence before attempting to open it is a perfectly reasonable and common approach.

When evaluating LLMs/agents using the Aider benchmark, this mismatch between the test’s implicit requirement and the problem’s lack of explicit guidance leads to a failure in a one-shot attempt for this specific case, as some models default to the more intuitive os.path.exists() strategy, which then fails the mock-based test.

Proposed Solutions
To address this ambiguity and improve the fairness of the benchmark, I propose the following solutions:

Direct File Provision: Similar to the JavaScript track, consider providing actual .txt files in the testing environment rather than fully mocking the open() function with io.StringIO.
Comprehensive File Operation Mocking: Implement a more extensive mock for file system operations that also covers functions like os.path.exists(), ensuring it responds predictably to simulate file existence or non-existence.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests implmentation affects pass@1 in polyglot-benchmark #11

Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tests implmentation affects pass@1 in polyglot-benchmark #11

Description

Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions