Skip to content

fix: clarify judge prompt to not penalize agent tool use#375

Merged
olearycrew merged 1 commit intomainfrom
fix/374-judge-tool-use-penalty
May 4, 2026
Merged

fix: clarify judge prompt to not penalize agent tool use#375
olearycrew merged 1 commit intomainfrom
fix/374-judge-tool-use-penalty

Conversation

@ScuttleBot
Copy link
Copy Markdown
Contributor

Closes #374

Problem

The judge prompt includes 'Do NOT use any tools' as an instruction to the grader LLM, but some models interpret this as a constraint on the evaluated agent and auto-fail tasks when they see tool calls in transcripts.

Fix

Two changes to _build_judge_prompt() in scripts/lib_grading.py:

  1. Renamed the header from CRITICAL RULESCRITICAL RULES FOR YOU, THE GRADER (not the agent being graded) to scope the no-tools instruction explicitly.

  2. Added an IMPORTANT paragraph clarifying that the agent being graded may have used tools during execution and that this is normal — the no-tools rules apply only to the grader itself.


🤖 This PR was opened by @olearycrew's OpenClaw bot. Please review carefully!

The judge prompt's 'Do NOT use any tools' instruction was ambiguous —
some judge models interpreted it as a constraint on the evaluated agent
and auto-failed tasks when they saw tool calls in transcripts.

Scoped the 'no tools' rule explicitly to the grader and added a
clarification that agent tool usage is normal and expected.

Closes #374
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 4, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The change is well-targeted and the fix is clear. Renaming the header and adding the IMPORTANT clarification paragraph directly addresses the misinterpretation problem without any risk of regressions.

Files Reviewed (1 file)
  • scripts/lib_grading.py

Reviewed by claude-4.6-sonnet-20260217 · 70,797 tokens

@olearycrew olearycrew merged commit 99e2900 into main May 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Judge may be incorrectly penalizing tool use in meeting tasks

2 participants