Skip to content

fix: snapshot workspace before parallel grading to prevent race condition#376

Closed
ScuttleBot wants to merge 2 commits intopinchbench:mainfrom
ScuttleBot:fix/365-parallel-grading-race
Closed

fix: snapshot workspace before parallel grading to prevent race condition#376
ScuttleBot wants to merge 2 commits intopinchbench:mainfrom
ScuttleBot:fix/365-parallel-grading-race

Conversation

@ScuttleBot
Copy link
Copy Markdown
Contributor

Problem

When parallel grading is enabled (ThreadPoolExecutor, introduced in #351), grade_task() runs in a background thread while the main thread immediately moves on to the next task and calls prepare_task_workspace(), which rebuilds the shared agent workspace directory. The background grader then reads the next task's workspace files instead of the current one's, causing:

  • Transcript mismatches (task N's transcript archived as task N+1's)
  • Incorrect automated scores (task_sanity scoring 0 because its workspace state is gone)

Fix

Before submitting grade_task to the ThreadPoolExecutor, snapshot the workspace directory into an isolated temp directory via shutil.copytree. A shallow copy of execution_result with the snapshot path replaces the live workspace reference, so the background grader reads stable files regardless of what the main thread does next.

Cleanup happens in _wait_for_pending_grade() after grading completes (shutil.rmtree).

The synchronous grading path is unchanged.

Changes

  • Added shutil and tempfile imports
  • Added _snapshot_workspace_for_grading logic inline in the parallel submission block
  • Added pending_grade_snapshot_dir tracking variable
  • Added cleanup in _wait_for_pending_grade()

Closes #365

github-actions Bot and others added 2 commits May 4, 2026 16:08
…tion

When parallel grading is enabled, grade_task() runs in a background thread
while the main thread prepares the next task's workspace. The background
grader then reads the rebuilt workspace instead of the completed task's
files, causing transcript mismatches and incorrect automated scores.

Fix: before submitting grade_task to the ThreadPoolExecutor, deep-copy the
workspace directory into an isolated temp directory via shutil.copytree.
A shallow copy of execution_result with the snapshot path is passed to the
background grader. The snapshot is cleaned up in _wait_for_pending_grade()
after grading completes.

The synchronous grading path is unchanged.

Closes pinchbench#365
Comment thread scripts/benchmark.py
workspace_path = result.get("workspace", "")
if workspace_path and os.path.isdir(workspace_path):
snapshot_dir = tempfile.mkdtemp(prefix="pinchbench_grade_")
shutil.copytree(workspace_path, os.path.join(snapshot_dir, "ws"), dirs_exist_ok=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: shutil.copytree is not wrapped in a try/except here. If it raises (e.g. disk full, broken symlink, permission error), the exception propagates uncaught through the task loop and crashes the entire benchmark run — losing all results collected so far. The temp dir created on the line above would also be leaked.

Consider wrapping the snapshot block in a try/except that falls back to using the original result (accepting the small race-condition risk) rather than aborting the run:

try:
    snapshot_dir = tempfile.mkdtemp(prefix="pinchbench_grade_")
    shutil.copytree(workspace_path, os.path.join(snapshot_dir, "ws"), dirs_exist_ok=True)
    snapshot_workspace = os.path.join(snapshot_dir, "ws")
    grade_result_copy = dict(result)
    grade_result_copy["workspace"] = snapshot_workspace
    grade_kwargs["execution_result"] = grade_result_copy
    pending_grade_snapshot_dir = snapshot_dir
except Exception as exc:
    logger.warning("Workspace snapshot failed, grading with live path: %s", exc)
    pending_grade_snapshot_dir = None

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 4, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
WARNING 1
Issue Details (click to expand)

WARNING

File Line Issue
scripts/benchmark.py 976 shutil.copytree unhandled exception can crash the entire benchmark run and leak the temp dir
Files Reviewed (2 files)
  • README.md — no issues (task count update)
  • scripts/benchmark.py — 1 issue

The race-condition fix is well-reasoned and correctly scoped. The snapshot + cleanup pattern is clean. The only concern is the missing error guard around shutil.copytree — a disk-full or permission error would propagate uncaught and abort the whole run. See the inline comment for a suggested fallback.

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 159,069 tokens

@ScuttleBot
Copy link
Copy Markdown
Contributor Author

Closing in favor of #372 which has a cleaner implementation (dedicated helper function, better documentation, no unrelated README changes). Both PRs solved the same race condition the same way.

@ScuttleBot ScuttleBot closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] issue with transcript mismatch and grading

1 participant