Skip to content

Conversation

@sohankunkerkar
Copy link
Member

I think this test uses RecoveryTimeout: 10ms to validate eviction behavior. With such a small timeout, multiple evictions can happen quickly. After a pod failure, the workload gets re-admitted in ~50ms, but since RecoveryTimeout is only 10ms, if the pods aren’t Ready yet, a second eviction is triggered before they become Ready and before RequeueState is cleared.

With BackoffLimitCount: 1, the first eviction increments the count to 1. When the second eviction occurs, the condition count+1 > BackoffLimitCount (i.e., 1+1 > 1) evaluates to true, which causes a deactivation and clears RequeueState entirely.

Increasing RecoveryTimeout would defeat the purpose of the test, which is specifically testing behavior with a tiny RecoveryTimeout.

What type of PR is this?

/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #5106

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

Copilot AI review requested due to automatic review settings December 6, 2025 21:17
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/flake Categorizes issue or PR as related to a flaky test. labels Dec 6, 2025
@netlify
Copy link

netlify bot commented Dec 6, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit ddb2f1f
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6935d5fab1550a0008d3c3ef

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 6, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a flaky end-to-end test that validates workload eviction and requeue behavior with a tiny RecoveryTimeout (10ms). The flakiness occurred because multiple evictions could happen rapidly before pods became Ready, causing premature deactivation with the original BackoffLimitCount: 1.

Key changes:

  • Increased BackoffLimitCount from 1 to 2 to allow multiple eviction cycles before deactivation
  • Updated test assertions to use >=1 comparisons instead of exact counts to accommodate multiple evictions
  • Replaced struct comparison with individual field assertions for better test clarity

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +245 to +246
// Allow at least one requeue cycle before deactivation so the test
// can verify the requeue behavior.
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment states "Allow at least one requeue cycle before deactivation" but with BackoffLimitCount: 2, the workload would be deactivated when count+1 > BackoffLimitCount, i.e., when count+1 > 2, which means deactivation occurs when count reaches 2 (since 2+1 > 2). This allows for 2 evictions before deactivation, not just "at least one requeue cycle."

Consider clarifying the comment to be more precise:

// Allow up to 2 evictions before deactivation so the test
// can verify the requeue behavior even if multiple evictions occur.
Suggested change
// Allow at least one requeue cycle before deactivation so the test
// can verify the requeue behavior.
// Allow up to 2 evictions before deactivation so the test
// can verify the requeue behavior even if multiple evictions occur.

Copilot uses AI. Check for mistakes.
@sohankunkerkar
Copy link
Member Author

/retest

1 similar comment
@sohankunkerkar
Copy link
Member Author

/retest

Signed-off-by: Sohan Kunkerkar <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Dec 7, 2025
@mimowo
Copy link
Contributor

mimowo commented Dec 8, 2025

/retest

@sohankunkerkar sohankunkerkar changed the title [WIP] Fix flaky RecoveryTimeout e2e test Fix flaky RecoveryTimeout e2e test Dec 9, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 9, 2025
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this makes sense 👍
/lgtm
/approve
/cherrypick release-0.15
/cherrypick release-0.14

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f7fe0a7671efd40cc2b3066d4abdfb5f0854d434

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, sohankunkerkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2025
@k8s-ci-robot k8s-ci-robot merged commit 217d8e4 into kubernetes-sigs:main Dec 10, 2025
28 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.16 milestone Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Flaky test] Timeout and a tiny RecoveryTimeout should evict and requeue workload when pod failure causes recovery timeou

3 participants