Add eval.yaml for verify-tests-fail-without-fix skill#34815
Add eval.yaml for verify-tests-fail-without-fix skill#34815
Conversation
- 6 scenarios covering both verification modes, negative trigger, edge cases - Rubric-based behavioral assertions (0 output_contains, no vocabulary overfitting) - Tests the critical 'pass without fix = FAILED verification' semantic inversion - Production-aware prompt design with varied structure - Follows eval best practices from try-fix evaluation cycle (PR #34807) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34815Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34815" |
There was a problem hiding this comment.
Pull request overview
Adds an evaluation specification (tests/eval.yaml) for the internal verify-tests-fail-without-fix skill to enable scenario-based validation via skill-validator.
Changes:
- Introduces a new
eval.yamlwith 6 scenarios covering full verification, failure-only verification, negative trigger behavior, and regressions/edge cases.
| - type: output_not_contains | ||
| value: "no fix files" |
There was a problem hiding this comment.
In the “verify failure only” scenario, the output_not_contains: "no fix files" assertion conflicts with the script/docs: verify-tests-fail.ps1 explicitly describes this mode as “no fix files detected”. This assertion is likely to fail even when the skill behaves correctly; consider removing it or narrowing it to only prohibit requiring fix files (e.g., erroring because fix files are missing).
| - type: output_not_contains | |
| value: "no fix files" |
| value: "git checkout HEAD" | ||
| - type: output_not_contains | ||
| value: "git restore" | ||
| - type: output_not_contains | ||
| value: "git stash" |
There was a problem hiding this comment.
These output_not_contains checks ban git checkout HEAD/git restore/git stash, but the skill’s own verify-tests-fail.ps1 uses git checkout ... internally (and even prints “git checkout HEAD” in some error messages). This can create false failures by rejecting correct script-driven behavior; consider rewriting the anti-pattern to target the agent’s manual instructions (e.g., “run git checkout/restore yourself”) rather than substrings that may legitimately appear in script output.
| value: "git checkout HEAD" | |
| - type: output_not_contains | |
| value: "git restore" | |
| - type: output_not_contains | |
| value: "git stash" | |
| value: "run git checkout HEAD" | |
| - type: output_not_contains | |
| value: "run git restore" | |
| - type: output_not_contains | |
| value: "run git stash" |
eval.yaml: - Add positive assertions to scenarios 1, 2, 5 - Fix scenario 2: remove broad RequireFullVerification/no-fix-files bans - Narrow scenario 6 assertions to action phrases - Add scenario 7: RequireFullVerification flag usage - Add scenario 8: inverted semantics during execution SKILL.md: - Add Activation Guard section - Add inverted pass/fail semantics warning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SKILL.md: - Add Workflow section with Steps 1-4 (Determine Mode, Construct Command, Interpret Results, Report) eval.yaml: - Add scenario 9: PR label automation - Add scenario 10: script auto-detection of test files - Now 10 scenarios total Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Summary
Adds eval.yaml for the
verify-tests-fail-without-fixskill, enabling empirical A/B validation via skill-validator.Context
pr-reviewto verify tests catch bugsEval Design
output_contains-- rubric-based behavioral assertions only (no vocabulary overfitting)output_not_contains-- anti-pattern guards for common mistakesexpect_activation: false-- native spec field for negative triggerScenarios