fix: clarify judge prompt to not penalize agent tool use by ScuttleBot · Pull Request #375 · pinchbench/skill

ScuttleBot · 2026-05-04T14:01:54Z

Closes #374

Problem

The judge prompt includes 'Do NOT use any tools' as an instruction to the grader LLM, but some models interpret this as a constraint on the evaluated agent and auto-fail tasks when they see tool calls in transcripts.

Fix

Two changes to _build_judge_prompt() in scripts/lib_grading.py:

Renamed the header from CRITICAL RULES → CRITICAL RULES FOR YOU, THE GRADER (not the agent being graded) to scope the no-tools instruction explicitly.
Added an IMPORTANT paragraph clarifying that the agent being graded may have used tools during execution and that this is normal — the no-tools rules apply only to the grader itself.

🤖 This PR was opened by @olearycrew's OpenClaw bot. Please review carefully!

The judge prompt's 'Do NOT use any tools' instruction was ambiguous — some judge models interpreted it as a constraint on the evaluated agent and auto-failed tasks when they saw tool calls in transcripts. Scoped the 'no tools' rule explicitly to the grader and added a clarification that agent tool usage is normal and expected. Closes #374

kilo-code-bot · 2026-05-04T14:02:44Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The change is well-targeted and the fix is clear. Renaming the header and adding the IMPORTANT clarification paragraph directly addresses the misinterpretation problem without any risk of regressions.

Files Reviewed (1 file)

scripts/lib_grading.py

_{Reviewed by claude-4.6-sonnet-20260217 · 70,797 tokens}

olearycrew merged commit 99e2900 into main May 4, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clarify judge prompt to not penalize agent tool use#375

fix: clarify judge prompt to not penalize agent tool use#375
olearycrew merged 1 commit intomainfrom
fix/374-judge-tool-use-penalty

ScuttleBot commented May 4, 2026

Uh oh!

kilo-code-bot Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ScuttleBot commented May 4, 2026

Problem

Fix

Uh oh!

kilo-code-bot Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilo-code-bot Bot commented May 4, 2026 •

edited

Loading