Add eval.yaml for verify-tests-fail-without-fix skill by PureWeen · Pull Request #34815 · dotnet/maui

PureWeen · 2026-04-04T23:20:27Z

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Summary

Adds eval.yaml for the verify-tests-fail-without-fix skill, enabling empirical A/B validation via skill-validator.

Context

This is an internal orchestrator-invoked skill used by pr-review to verify tests catch bugs
Follows eval best practices established during the try-fix evaluation cycle (PR Improve try-fix skill: add eval.yaml and fix prompt issues #34807)
Part of eval coverage expansion tracked in issue Establish eval.yaml lifecycle and CI integration for skill validation #34814

Eval Design

6 scenarios covering both verification modes, negative trigger, edge cases, regressions
0 output_contains -- rubric-based behavioral assertions only (no vocabulary overfitting)
14 output_not_contains -- anti-pattern guards for common mistakes
1 expect_activation: false -- native spec field for negative trigger
Realistic timeouts (60s-900s depending on scenario complexity)

Scenarios

Happy path: full verification -- Tests two-phase workflow (fail without fix, pass with fix)
Happy path: verify failure only -- Tests test-creation mode (no fix needed)
Negative trigger -- Documentation question should not invoke verification
Regression: semantic inversion -- Tests passing without fix = FAILED verification (not success!)
Edge case: no test files -- PR without tests can't be verified
Regression: no manual git commands -- Script handles file revert/restore, not raw git

- 6 scenarios covering both verification modes, negative trigger, edge cases - Rubric-based behavioral assertions (0 output_contains, no vocabulary overfitting) - Tests the critical 'pass without fix = FAILED verification' semantic inversion - Production-aware prompt design with varied structure - Follows eval best practices from try-fix evaluation cycle (PR #34807) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-04-04T23:20:37Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34815

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34815"

Copilot

Pull request overview

Adds an evaluation specification (tests/eval.yaml) for the internal verify-tests-fail-without-fix skill to enable scenario-based validation via skill-validator.

Changes:

Introduces a new eval.yaml with 6 scenarios covering full verification, failure-only verification, negative trigger behavior, and regressions/edge cases.

Copilot · 2026-04-04T23:23:46Z

.github/skills/verify-tests-fail-without-fix/tests/eval.yaml

+      - type: output_not_contains
+        value: "no fix files"


In the “verify failure only” scenario, the output_not_contains: "no fix files" assertion conflicts with the script/docs: verify-tests-fail.ps1 explicitly describes this mode as “no fix files detected”. This assertion is likely to fail even when the skill behaves correctly; consider removing it or narrowing it to only prohibit requiring fix files (e.g., erroring because fix files are missing).

Suggested change

- type: output_not_contains

value: "no fix files"

Copilot · 2026-04-04T23:23:46Z

.github/skills/verify-tests-fail-without-fix/tests/eval.yaml

+        value: "git checkout HEAD"
+      - type: output_not_contains
+        value: "git restore"
+      - type: output_not_contains
+        value: "git stash"


These output_not_contains checks ban git checkout HEAD/git restore/git stash, but the skill’s own verify-tests-fail.ps1 uses git checkout ... internally (and even prints “git checkout HEAD” in some error messages). This can create false failures by rejecting correct script-driven behavior; consider rewriting the anti-pattern to target the agent’s manual instructions (e.g., “run git checkout/restore yourself”) rather than substrings that may legitimately appear in script output.

Suggested change

value: "git checkout HEAD"

- type: output_not_contains

value: "git restore"

- type: output_not_contains

value: "git stash"

value: "run git checkout HEAD"

- type: output_not_contains

value: "run git restore"

- type: output_not_contains

value: "run git stash"

eval.yaml: - Add positive assertions to scenarios 1, 2, 5 - Fix scenario 2: remove broad RequireFullVerification/no-fix-files bans - Narrow scenario 6 assertions to action phrases - Add scenario 7: RequireFullVerification flag usage - Add scenario 8: inverted semantics during execution SKILL.md: - Add Activation Guard section - Add inverted pass/fail semantics warning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

SKILL.md: - Add Workflow section with Steps 1-4 (Determine Mode, Construct Command, Interpret Results, Report) eval.yaml: - Add scenario 9: PR label automation - Add scenario 10: script auto-detection of test files - Now 10 scenarios total Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 4, 2026 23:20

Copilot started reviewing on behalf of PureWeen April 4, 2026 23:21 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

github-actions bot mentioned this pull request Apr 5, 2026

[repo-status] 📊 Daily Repo Status - April 5, 2026 #34817

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval.yaml for verify-tests-fail-without-fix skill#34815

Add eval.yaml for verify-tests-fail-without-fix skill#34815
PureWeen wants to merge 3 commits intomainfrom
skill-eval/verify-tests-fail-without-fix

PureWeen commented Apr 4, 2026

Uh oh!

github-actions bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PureWeen commented Apr 4, 2026

Summary

Context

Eval Design

Scenarios

Uh oh!

github-actions bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Apr 4, 2026 •

edited

Loading