fix(scheduler): remove node from cache when node selector no longer matches by csh0101 · Pull Request #5123 · volcano-sh/volcano

csh0101 · 2026-03-20T17:58:57Z

Problem

When a node's label is removed and it no longer matches the node selector (--node-selector flag), the scheduler cache was not removing the node, causing pods to still be scheduled to that node.

Fixes #5122

Solution

In SyncNode function, when nodeCanAddCache returns false, check if the node exists in cache and remove it if necessary.

Changes

Modified pkg/scheduler/cache/event_handlers.go
When node no longer matches node selector but exists in cache, remove it from cache
Added appropriate error handling and logging

Testing

Remove label from a node that was previously matching node selector
Verify node is removed from scheduler cache
Verify pods are no longer scheduled to that node

volcano-sh-bot · 2026-03-20T17:59:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kingeasternsun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-03-20T17:59:08Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the scheduler's caching mechanism where nodes were not being evicted from the cache when their labels no longer aligned with the configured node selector. The fix ensures that the scheduler's view of available nodes remains consistent with the current cluster state, preventing pods from being scheduled onto inappropriate nodes and improving the overall reliability of the scheduling process.

Highlights

Scheduler Cache Inconsistency: Addressed an issue where the scheduler cache retained nodes even after their labels changed, causing them to no longer match the --node-selector flag, leading to incorrect pod scheduling.
Node Removal Logic: Implemented logic within the SyncNode function to explicitly remove a node from the scheduler cache if it no longer satisfies the nodeCanAddCache condition but is still present in the cache.
Error Handling and Logging: Included robust error handling and informative logging for cases where a node fails to be removed from the cache.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR fixes issue #5122 where the Volcano scheduler cache was not removing nodes when their labels were modified such that they no longer matched the configured node selector. The fix adds logic to the SyncNode function to detect when a node no longer matches the selector criteria and removes it from the cache accordingly.

Changes:

Added node removal logic in SyncNode when nodeCanAddCache returns false but the node exists in cache
Includes appropriate error handling and logging for the removal operation
Addresses the specific scenario where a node's labels are updated to no longer match the node selector

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T18:01:27Z

pkg/scheduler/cache/event_handlers.go

+		// If node exists in cache but no longer matches node selector, remove it
+		if _, exists := sc.Nodes[nodeName]; exists {
+			if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
+				klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",
+					nodeName, deleteErr.Error())
+				return deleteErr
+			}
+			klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
+		}


There is a potential race condition here. The check at line 625 reads sc.Nodes[nodeName] without holding the lock, but RemoveNode at line 626 acquires the lock and modifies sc.Nodes. Since there can be multiple concurrent node workers (see cache.go:822), another goroutine could remove the node between the check and the RemoveNode call, causing RemoveNode to return an error (see event_handlers.go:540-541). The exists check should be moved inside a lock or removed entirely since RemoveNode already handles the case where the node doesn't exist (it returns an error that could be ignored if it's already been removed).

Suggested change

// If node exists in cache but no longer matches node selector, remove it

if _, exists := sc.Nodes[nodeName]; exists {

if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {

klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",

nodeName, deleteErr.Error())

return deleteErr

}

klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)

}

// Node no longer matches node selector, ensure it is removed from cache

if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {

klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",

nodeName, deleteErr.Error())

return deleteErr

}

klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)

Copilot · 2026-03-20T18:01:27Z

pkg/scheduler/cache/event_handlers.go

+		// If node exists in cache but no longer matches node selector, remove it
+		if _, exists := sc.Nodes[nodeName]; exists {
+			if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
+				klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",
+					nodeName, deleteErr.Error())
+				return deleteErr
+			}
+			klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
+		}


The new behavior added in this PR (removing a node from cache when it no longer matches the node selector) is not covered by any test case. The existing TestSchedulerCache_SyncNode test has cases for "Node added to cache" and "Node not added to cache", but no test case for "Node removed from cache when selector no longer matches". This scenario should be tested with a node that is first added to the cache (when it matches the selector) and then synced again with a different node state (without the matching label).

gemini-code-assist

Code Review

The pull request addresses an issue where nodes were not being removed from the scheduler cache when their node selectors no longer matched, leading to incorrect pod scheduling. The solution involves modifying the SyncNode function to check for and remove such nodes from the cache. A review comment has been added to address a potential issue.

gemini-code-assist · 2026-03-20T18:06:21Z

pkg/scheduler/cache/event_handlers.go

+		// If node exists in cache but no longer matches node selector, remove it
+		if _, exists := sc.Nodes[nodeName]; exists {
+			if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
+				klog.Errorf("Failed to remove node <%s> from cache that no longer matches node selector: %s",
+					nodeName, deleteErr.Error())
+				return deleteErr
+			}
+			klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)
+		}


The added code checks if a node exists in the cache but no longer matches the node selector, and if so, removes it. However, there's a potential race condition here. If sc.RemoveNode takes a significant amount of time, another goroutine might add the node back to the cache before the removal is complete. This could lead to the node being briefly removed and then re-added, potentially causing scheduling inconsistencies.

To mitigate this, consider using a mutex or a channel to synchronize access to the sc.Nodes map during the removal process. This will ensure that only one goroutine can modify the cache at a time, preventing race conditions.

…atches When a node's label is removed and it no longer matches the node selector, the scheduler cache was not removing the node, causing pods to still be scheduled to that node. This fix checks if the node exists in cache when nodeCanAddCache returns false, and removes it if necessary. Related-to: 5122 Signed-off-by: csh0101 <csh0101@example.com>

hajnalmt

Thanks for the PR @csh0101 !
We will need a much more robust implementation for this with e2e verificaton.

Please check my comment here.

Copilot AI review requested due to automatic review settings March 20, 2026 17:58

volcano-sh-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 20, 2026

volcano-sh-bot requested review from hwdef and william-wang March 20, 2026 17:59

volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 20, 2026

Copilot started reviewing on behalf of csh0101 March 20, 2026 17:59 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

csh0101 force-pushed the fix-sync-node-cache-removal branch from 0d6bdeb to 87e7f5e Compare March 21, 2026 08:13

csh0101 force-pushed the fix-sync-node-cache-removal branch from 87e7f5e to 89c766d Compare March 21, 2026 13:46

volcano-sh-bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 21, 2026

hajnalmt suggested changes Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): remove node from cache when node selector no longer matches#5123

fix(scheduler): remove node from cache when node selector no longer matches#5123
csh0101 wants to merge 1 commit intovolcano-sh:masterfrom
csh0101:fix-sync-node-cache-removal

csh0101 commented Mar 20, 2026

Uh oh!

volcano-sh-bot commented Mar 20, 2026

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

hajnalmt left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

csh0101 commented Mar 20, 2026

Problem

Solution

Changes

Testing

Uh oh!

volcano-sh-bot commented Mar 20, 2026

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants