Skip to content

Add eval.yaml for verify-tests-fail-without-fix skill#34815

Open
PureWeen wants to merge 3 commits intomainfrom
skill-eval/verify-tests-fail-without-fix
Open

Add eval.yaml for verify-tests-fail-without-fix skill#34815
PureWeen wants to merge 3 commits intomainfrom
skill-eval/verify-tests-fail-without-fix

Conversation

@PureWeen
Copy link
Copy Markdown
Member

@PureWeen PureWeen commented Apr 4, 2026

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Summary

Adds eval.yaml for the verify-tests-fail-without-fix skill, enabling empirical A/B validation via skill-validator.

Context

Eval Design

  • 6 scenarios covering both verification modes, negative trigger, edge cases, regressions
  • 0 output_contains -- rubric-based behavioral assertions only (no vocabulary overfitting)
  • 14 output_not_contains -- anti-pattern guards for common mistakes
  • 1 expect_activation: false -- native spec field for negative trigger
  • Realistic timeouts (60s-900s depending on scenario complexity)

Scenarios

  1. Happy path: full verification -- Tests two-phase workflow (fail without fix, pass with fix)
  2. Happy path: verify failure only -- Tests test-creation mode (no fix needed)
  3. Negative trigger -- Documentation question should not invoke verification
  4. Regression: semantic inversion -- Tests passing without fix = FAILED verification (not success!)
  5. Edge case: no test files -- PR without tests can't be verified
  6. Regression: no manual git commands -- Script handles file revert/restore, not raw git

- 6 scenarios covering both verification modes, negative trigger, edge cases
- Rubric-based behavioral assertions (0 output_contains, no vocabulary overfitting)
- Tests the critical 'pass without fix = FAILED verification' semantic inversion
- Production-aware prompt design with varied structure
- Follows eval best practices from try-fix evaluation cycle (PR #34807)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 4, 2026 23:20
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34815

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34815"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an evaluation specification (tests/eval.yaml) for the internal verify-tests-fail-without-fix skill to enable scenario-based validation via skill-validator.

Changes:

  • Introduces a new eval.yaml with 6 scenarios covering full verification, failure-only verification, negative trigger behavior, and regressions/edge cases.

Comment on lines +36 to +37
- type: output_not_contains
value: "no fix files"
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the “verify failure only” scenario, the output_not_contains: "no fix files" assertion conflicts with the script/docs: verify-tests-fail.ps1 explicitly describes this mode as “no fix files detected”. This assertion is likely to fail even when the skill behaves correctly; consider removing it or narrowing it to only prohibit requiring fix files (e.g., erroring because fix files are missing).

Suggested change
- type: output_not_contains
value: "no fix files"

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +109
value: "git checkout HEAD"
- type: output_not_contains
value: "git restore"
- type: output_not_contains
value: "git stash"
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These output_not_contains checks ban git checkout HEAD/git restore/git stash, but the skill’s own verify-tests-fail.ps1 uses git checkout ... internally (and even prints “git checkout HEAD” in some error messages). This can create false failures by rejecting correct script-driven behavior; consider rewriting the anti-pattern to target the agent’s manual instructions (e.g., “run git checkout/restore yourself”) rather than substrings that may legitimately appear in script output.

Suggested change
value: "git checkout HEAD"
- type: output_not_contains
value: "git restore"
- type: output_not_contains
value: "git stash"
value: "run git checkout HEAD"
- type: output_not_contains
value: "run git restore"
- type: output_not_contains
value: "run git stash"

Copilot uses AI. Check for mistakes.
eval.yaml:
- Add positive assertions to scenarios 1, 2, 5
- Fix scenario 2: remove broad RequireFullVerification/no-fix-files bans
- Narrow scenario 6 assertions to action phrases
- Add scenario 7: RequireFullVerification flag usage
- Add scenario 8: inverted semantics during execution

SKILL.md:
- Add Activation Guard section
- Add inverted pass/fail semantics warning

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SKILL.md:
- Add Workflow section with Steps 1-4 (Determine Mode, Construct Command, Interpret Results, Report)

eval.yaml:
- Add scenario 9: PR label automation
- Add scenario 10: script auto-detection of test files
- Now 10 scenarios total

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants