fix(scheduler): remove node from cache when node selector no longer matches#5123
fix(scheduler): remove node from cache when node selector no longer matches#5123csh0101 wants to merge 1 commit intovolcano-sh:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug in the scheduler's caching mechanism where nodes were not being evicted from the cache when their labels no longer aligned with the configured node selector. The fix ensures that the scheduler's view of available nodes remains consistent with the current cluster state, preventing pods from being scheduled onto inappropriate nodes and improving the overall reliability of the scheduling process. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR fixes issue #5122 where the Volcano scheduler cache was not removing nodes when their labels were modified such that they no longer matched the configured node selector. The fix adds logic to the SyncNode function to detect when a node no longer matches the selector criteria and removes it from the cache accordingly.
Changes:
- Added node removal logic in
SyncNodewhennodeCanAddCachereturns false but the node exists in cache - Includes appropriate error handling and logging for the removal operation
- Addresses the specific scenario where a node's labels are updated to no longer match the node selector
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // If node exists in cache but no longer matches node selector, remove it | ||
| if _, exists := sc.Nodes[nodeName]; exists { | ||
| if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil { | ||
| klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s", | ||
| nodeName, deleteErr.Error()) | ||
| return deleteErr | ||
| } | ||
| klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName) | ||
| } |
There was a problem hiding this comment.
There is a potential race condition here. The check at line 625 reads sc.Nodes[nodeName] without holding the lock, but RemoveNode at line 626 acquires the lock and modifies sc.Nodes. Since there can be multiple concurrent node workers (see cache.go:822), another goroutine could remove the node between the check and the RemoveNode call, causing RemoveNode to return an error (see event_handlers.go:540-541). The exists check should be moved inside a lock or removed entirely since RemoveNode already handles the case where the node doesn't exist (it returns an error that could be ignored if it's already been removed).
| // If node exists in cache but no longer matches node selector, remove it | |
| if _, exists := sc.Nodes[nodeName]; exists { | |
| if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil { | |
| klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s", | |
| nodeName, deleteErr.Error()) | |
| return deleteErr | |
| } | |
| klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName) | |
| } | |
| // Node no longer matches node selector, ensure it is removed from cache | |
| if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil { | |
| klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s", | |
| nodeName, deleteErr.Error()) | |
| return deleteErr | |
| } | |
| klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName) |
| // If node exists in cache but no longer matches node selector, remove it | ||
| if _, exists := sc.Nodes[nodeName]; exists { | ||
| if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil { | ||
| klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s", | ||
| nodeName, deleteErr.Error()) | ||
| return deleteErr | ||
| } | ||
| klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName) | ||
| } |
There was a problem hiding this comment.
The new behavior added in this PR (removing a node from cache when it no longer matches the node selector) is not covered by any test case. The existing TestSchedulerCache_SyncNode test has cases for "Node added to cache" and "Node not added to cache", but no test case for "Node removed from cache when selector no longer matches". This scenario should be tested with a node that is first added to the cache (when it matches the selector) and then synced again with a different node state (without the matching label).
There was a problem hiding this comment.
Code Review
The pull request addresses an issue where nodes were not being removed from the scheduler cache when their node selectors no longer matched, leading to incorrect pod scheduling. The solution involves modifying the SyncNode function to check for and remove such nodes from the cache. A review comment has been added to address a potential issue.
| // If node exists in cache but no longer matches node selector, remove it | ||
| if _, exists := sc.Nodes[nodeName]; exists { | ||
| if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil { | ||
| klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s", | ||
| nodeName, deleteErr.Error()) | ||
| return deleteErr | ||
| } | ||
| klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName) | ||
| } |
There was a problem hiding this comment.
The added code checks if a node exists in the cache but no longer matches the node selector, and if so, removes it. However, there's a potential race condition here. If sc.RemoveNode takes a significant amount of time, another goroutine might add the node back to the cache before the removal is complete. This could lead to the node being briefly removed and then re-added, potentially causing scheduling inconsistencies.
To mitigate this, consider using a mutex or a channel to synchronize access to the sc.Nodes map during the removal process. This will ensure that only one goroutine can modify the cache at a time, preventing race conditions.
0d6bdeb to
87e7f5e
Compare
…atches When a node's label is removed and it no longer matches the node selector, the scheduler cache was not removing the node, causing pods to still be scheduled to that node. This fix checks if the node exists in cache when nodeCanAddCache returns false, and removes it if necessary. Related-to: 5122 Signed-off-by: csh0101 <csh0101@example.com>
87e7f5e to
89c766d
Compare
Problem
When a node's label is removed and it no longer matches the node selector (--node-selector flag), the scheduler cache was not removing the node, causing pods to still be scheduled to that node.
Fixes #5122
Solution
In
SyncNodefunction, whennodeCanAddCachereturns false, check if the node exists in cache and remove it if necessary.Changes
pkg/scheduler/cache/event_handlers.goTesting