-
Notifications
You must be signed in to change notification settings - Fork 479
Use WithLooseOnApply() in scheduler eviction. #7933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Use WithLooseOnApply() in scheduler eviction. #7933
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mbobrovskyi The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
/hold EDIT: iiuc this is performance optimization to avoid retrying scheduling cycle on conflicts, or another reason? In any case if we want to change how eviction works, I think we should also follow the same pattern for preemptions. |
pkg/scheduler/scheduler.go
Outdated
| if err := workload.Evict(ctx, s.client, s.recorder, wl, kueue.WorkloadEvictedDueToNodeFailures, msg, "", s.clock, | ||
| workload.EvictWithLooseOnApply(), workload.EvictWithRetryOnConflictForPatch()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler issues evictions (like preemptions) using strict mode: https://github.com/kubernetes-sigs/kueue/blob/4301e1c8f7f78debb3be609b7467cc310b2c10f7/pkg/scheduler/preemption/preemption.go#L172C37-L173
I'm not saying we cannot change it, but if we change then we should change all evictions (including preemptions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And since it was like that for a long time, then if we want to change it, then we should guard by feature gate. I think such a change can be considered for 0.16, because 0.15 is too soon to evaluate the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler issues evictions (like preemptions) using strict mode: https://github.com/kubernetes-sigs/kueue/blob/4301e1c8f7f78debb3be609b7467cc310b2c10f7/pkg/scheduler/preemption/preemption.go#L172C37-L173
Probably yes, I haven’t checked that yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, in that case let's park the change after 0.15 and re-evaluate then. I haven't seen from users any reports complaining about that scenario so far. I need more time to think how this should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen from users any reports complaining about that scenario so far.
It’s a very rare case that we can hit this, but I think we should handle it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's revisit post 0.15. I still think the strategy for handling this case of eviction should be also consistent with handling preemption, which is just another form of eviction. So I want to extend the scope of the PR to cover all evictions triggered by scheduler - assuming we go this way, but I need to think more after 0.15.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added workload.EvictWithLooseOnApply() and workload.EvictWithRetryOnConflictForPatch() for preemption.
Not only if we have an error here do we just continue without evicting — this is incorrect because it results in a different event reason and message. You can try reverting my changes and running the test. |
|
Ok but if this is bugfix then it requires different tagging and release note. Also I'm not clear why preemptions are any different. Also. I think we can just improve the message for SecondPassaFailed to concatenate the underlying reason as a quick fix without changing the mechanics |
|
btw, https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/7933/pull-kueue-test-integration-baseline-main/1993948535971647488 looks like unrelated new flake, wdyt? |
I think this is just a network issue, no? |
I think this is a new test and it has a flake / bug - we call StartManager, but I don't see StopManager EDIT: opened: #7935 |
|
/hold For v0.15.0 release. |
2b4593a to
07cfbbf
Compare
|
/unhold |
|
/hold |
|
I will try to return to this PR end of the week or next week. It seems like a performance improvement we could call maybe bugfix and cherrypick, but I need more time to think about it. Feel free to ping me EOW. |
|
@mbobrovskyi Can you make the description more exhaustive? What is the context/motivation for it and what consequences it has? |
Updated. Also explained here #7933 (comment). |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Use WithLooseOnApply() in evictWorkloadAfterFailedTASReplacement.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
On race condition, when we're trying to evict a workload, we hit an error here and just continue without evicting. This is incorrect because it results in a different event reason and message.
Here’s an example of what we get in the unit tests if we revert the changes:
Does this PR introduce a user-facing change?