-
Notifications
You must be signed in to change notification settings - Fork 480
Fix flaky RecoveryTimeout e2e test #8109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky RecoveryTimeout e2e test #8109
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a flaky end-to-end test that validates workload eviction and requeue behavior with a tiny RecoveryTimeout (10ms). The flakiness occurred because multiple evictions could happen rapidly before pods became Ready, causing premature deactivation with the original BackoffLimitCount: 1.
Key changes:
- Increased
BackoffLimitCountfrom 1 to 2 to allow multiple eviction cycles before deactivation - Updated test assertions to use
>=1comparisons instead of exact counts to accommodate multiple evictions - Replaced struct comparison with individual field assertions for better test clarity
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Allow at least one requeue cycle before deactivation so the test | ||
| // can verify the requeue behavior. |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment states "Allow at least one requeue cycle before deactivation" but with BackoffLimitCount: 2, the workload would be deactivated when count+1 > BackoffLimitCount, i.e., when count+1 > 2, which means deactivation occurs when count reaches 2 (since 2+1 > 2). This allows for 2 evictions before deactivation, not just "at least one requeue cycle."
Consider clarifying the comment to be more precise:
// Allow up to 2 evictions before deactivation so the test
// can verify the requeue behavior even if multiple evictions occur.| // Allow at least one requeue cycle before deactivation so the test | |
| // can verify the requeue behavior. | |
| // Allow up to 2 evictions before deactivation so the test | |
| // can verify the requeue behavior even if multiple evictions occur. |
|
/retest |
1 similar comment
|
/retest |
c679f69 to
63147a9
Compare
Signed-off-by: Sohan Kunkerkar <[email protected]>
63147a9 to
ddb2f1f
Compare
|
/retest |
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this makes sense 👍
/lgtm
/approve
/cherrypick release-0.15
/cherrypick release-0.14
|
LGTM label has been added. Git tree hash: f7fe0a7671efd40cc2b3066d4abdfb5f0854d434
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, sohankunkerkar The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I think this test uses
RecoveryTimeout: 10msto validate eviction behavior. With such a small timeout, multiple evictions can happen quickly. After a pod failure, the workload gets re-admitted in ~50ms, but since RecoveryTimeout is only 10ms, if the pods aren’t Ready yet, a second eviction is triggered before they become Ready and before RequeueState is cleared.With
BackoffLimitCount: 1, the first eviction increments the count to 1. When the second eviction occurs, the condition count+1 >BackoffLimitCount(i.e., 1+1 > 1) evaluates to true, which causes a deactivation and clearsRequeueStateentirely.Increasing
RecoveryTimeoutwould defeat the purpose of the test, which is specifically testing behavior with a tiny RecoveryTimeout.What type of PR is this?
/kind flake
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #5106
Special notes for your reviewer:
Does this PR introduce a user-facing change?