Skip to content

fix(scheduler): remove node from cache when node selector no longer matches#5123

Open
csh0101 wants to merge 1 commit intovolcano-sh:masterfrom
csh0101:fix-sync-node-cache-removal
Open

fix(scheduler): remove node from cache when node selector no longer matches#5123
csh0101 wants to merge 1 commit intovolcano-sh:masterfrom
csh0101:fix-sync-node-cache-removal

Conversation

@csh0101
Copy link
Copy Markdown

@csh0101 csh0101 commented Mar 20, 2026

Problem

When a node's label is removed and it no longer matches the node selector (--node-selector flag), the scheduler cache was not removing the node, causing pods to still be scheduled to that node.

Fixes #5122

Solution

In SyncNode function, when nodeCanAddCache returns false, check if the node exists in cache and remove it if necessary.

Changes

  • Modified pkg/scheduler/cache/event_handlers.go
  • When node no longer matches node selector but exists in cache, remove it from cache
  • Added appropriate error handling and logging

Testing

  • Remove label from a node that was previously matching node selector
  • Verify node is removed from scheduler cache
  • Verify pods are no longer scheduled to that node

Copilot AI review requested due to automatic review settings March 20, 2026 17:58
@volcano-sh-bot volcano-sh-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 20, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kingeasternsun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the scheduler's caching mechanism where nodes were not being evicted from the cache when their labels no longer aligned with the configured node selector. The fix ensures that the scheduler's view of available nodes remains consistent with the current cluster state, preventing pods from being scheduled onto inappropriate nodes and improving the overall reliability of the scheduling process.

Highlights

  • Scheduler Cache Inconsistency: Addressed an issue where the scheduler cache retained nodes even after their labels changed, causing them to no longer match the --node-selector flag, leading to incorrect pod scheduling.
  • Node Removal Logic: Implemented logic within the SyncNode function to explicitly remove a node from the scheduler cache if it no longer satisfies the nodeCanAddCache condition but is still present in the cache.
  • Error Handling and Logging: Included robust error handling and informative logging for cases where a node fails to be removed from the cache.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issue #5122 where the Volcano scheduler cache was not removing nodes when their labels were modified such that they no longer matched the configured node selector. The fix adds logic to the SyncNode function to detect when a node no longer matches the selector criteria and removes it from the cache accordingly.

Changes:

  • Added node removal logic in SyncNode when nodeCanAddCache returns false but the node exists in cache
  • Includes appropriate error handling and logging for the removal operation
  • Addresses the specific scenario where a node's labels are updated to no longer match the node selector

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +624 to +632
// If node exists in cache but no longer matches node selector, remove it
if _, exists := sc.Nodes[nodeName]; exists {
if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",
nodeName, deleteErr.Error())
return deleteErr
}
klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential race condition here. The check at line 625 reads sc.Nodes[nodeName] without holding the lock, but RemoveNode at line 626 acquires the lock and modifies sc.Nodes. Since there can be multiple concurrent node workers (see cache.go:822), another goroutine could remove the node between the check and the RemoveNode call, causing RemoveNode to return an error (see event_handlers.go:540-541). The exists check should be moved inside a lock or removed entirely since RemoveNode already handles the case where the node doesn't exist (it returns an error that could be ignored if it's already been removed).

Suggested change
// If node exists in cache but no longer matches node selector, remove it
if _, exists := sc.Nodes[nodeName]; exists {
if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",
nodeName, deleteErr.Error())
return deleteErr
}
klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
}
// Node no longer matches node selector, ensure it is removed from cache
if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",
nodeName, deleteErr.Error())
return deleteErr
}
klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)

Copilot uses AI. Check for mistakes.
Comment on lines +624 to +632
// If node exists in cache but no longer matches node selector, remove it
if _, exists := sc.Nodes[nodeName]; exists {
if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",
nodeName, deleteErr.Error())
return deleteErr
}
klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior added in this PR (removing a node from cache when it no longer matches the node selector) is not covered by any test case. The existing TestSchedulerCache_SyncNode test has cases for "Node added to cache" and "Node not added to cache", but no test case for "Node removed from cache when selector no longer matches". This scenario should be tested with a node that is first added to the cache (when it matches the selector) and then synced again with a different node state (without the matching label).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request addresses an issue where nodes were not being removed from the scheduler cache when their node selectors no longer matched, leading to incorrect pod scheduling. The solution involves modifying the SyncNode function to check for and remove such nodes from the cache. A review comment has been added to address a potential issue.

Comment on lines +624 to +632
// If node exists in cache but no longer matches node selector, remove it
if _, exists := sc.Nodes[nodeName]; exists {
if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",
nodeName, deleteErr.Error())
return deleteErr
}
klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The added code checks if a node exists in the cache but no longer matches the node selector, and if so, removes it. However, there's a potential race condition here. If sc.RemoveNode takes a significant amount of time, another goroutine might add the node back to the cache before the removal is complete. This could lead to the node being briefly removed and then re-added, potentially causing scheduling inconsistencies.

To mitigate this, consider using a mutex or a channel to synchronize access to the sc.Nodes map during the removal process. This will ensure that only one goroutine can modify the cache at a time, preventing race conditions.

@csh0101 csh0101 force-pushed the fix-sync-node-cache-removal branch from 0d6bdeb to 87e7f5e Compare March 21, 2026 08:13
…atches

When a node's label is removed and it no longer matches the node selector,
the scheduler cache was not removing the node, causing pods to still be
scheduled to that node.

This fix checks if the node exists in cache when nodeCanAddCache returns
false, and removes it if necessary.

Related-to: 5122

Signed-off-by: csh0101 <csh0101@example.com>
@csh0101 csh0101 force-pushed the fix-sync-node-cache-removal branch from 87e7f5e to 89c766d Compare March 21, 2026 13:46
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 21, 2026
Copy link
Copy Markdown
Member

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @csh0101 !
We will need a much more robust implementation for this with e2e verificaton.

Please check my comment here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scheduler cache does not remove node when remove label from node

4 participants