fix(scheduler): remove node from cache when node-selector label is removed by brick-pixel · Pull Request #5145 · volcano-sh/volcano

brick-pixel · 2026-03-31T08:01:46Z

What type of PR is this?

/kind bug
/area scheduling

What this PR does / why we need it

When using --node-selector in the volcano scheduler, removing the matching label from a node does not evict the node from the scheduler cache. This causes the scheduler to continue scheduling pods to nodes that no longer match the configured selector.

Root Cause

In SyncNode(), when nodeCanAddCache() returns false (node labels no longer match the selector), the function simply returns nil without calling RemoveNode(). If the node was previously in the cache (from when it had the matching label), it remains cached and eligible for scheduling.

Fix

Call RemoveNode() before returning when nodeCanAddCache() returns false. This ensures the node is evicted from cache when its labels no longer match the configured node selector. RemoveNode() is safe to call even if the node is not in the cache (it will simply be a no-op).

Which issue(s) this PR fixes

Fixes #5122

Does this PR introduce a user-facing change?

Fixed a bug where removing the node-selector label from a node did not remove it from the scheduler cache, causing pods to be incorrectly scheduled to non-matching nodes.

How Has This Been Tested?

Added TestSchedulerCache_SyncNode_RemoveOnLabelChange that:

Creates a node with a matching label and syncs it into cache
Removes the label from the node and syncs again
Verifies the node is no longer in the scheduler cache

gemini-code-assist

Code Review

This pull request modifies the SyncNode function in the scheduler cache to ensure that nodes are removed from the cache if they no longer match the configured node selector after a label update. It also includes a new test case to verify this behavior. A critical issue was identified where the current implementation would cause SyncNode to return errors and trigger unnecessary retries for any node in the cluster that does not match the selector, even if it was never in the cache. A suggestion was provided to check for the node's existence in the cache before attempting removal.

gemini-code-assist · 2026-03-31T08:03:04Z

pkg/scheduler/cache/event_handlers.go

+		if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
+			klog.Errorf("Failed to remove node <%s> from cache after label change: %s", nodeName, deleteErr.Error())
+			return deleteErr
+		}
+		klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)


The call to sc.RemoveNode(nodeName) returns an error if the node is not present in the cache (see lines 540-542). Since SyncNode is triggered for all node updates, and nodeCanAddCache returns false for any node that does not match the configured selector, this implementation will cause SyncNode to return an error for every non-matching node in the cluster. This results in continuous error logging and unnecessary retries in the nodeQueue. You should check if the node exists in the cache before attempting to remove it.

Suggested change

if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {

klog.Errorf("Failed to remove node <%s> from cache after label change: %s", nodeName, deleteErr.Error())

return deleteErr

}

klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)

sc.Mutex.Lock()

_, ok := sc.Nodes[nodeName]

sc.Mutex.Unlock()

if ok {

if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {

klog.Errorf("Failed to remove node <%s> from cache after label change: %s", nodeName, deleteErr.Error())

return deleteErr

}

klog.V(3).Infof("Node <%s> no longer matches node selector, removed from cache.", nodeName)

}

Copilot

Pull request overview

This PR fixes a bug where removing a node-selector label from a node does not evict it from the scheduler cache, allowing the scheduler to continue scheduling pods to non-matching nodes. The fix adds a call to RemoveNode() when nodeCanAddCache() returns false in the SyncNode() method, and includes a test to verify this behavior.

Changes:

Modified SyncNode() to remove nodes from cache when they no longer match the configured node selector
Added test case TestSchedulerCache_SyncNode_RemoveOnLabelChange to verify node removal on label change

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
pkg/scheduler/cache/event_handlers.go	Added logic to call RemoveNode() when nodeCanAddCache() returns false
pkg/scheduler/cache/event_handlers_test.go	Added test to verify node removal when label is removed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T08:04:53Z

pkg/scheduler/cache/event_handlers.go

+		if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
+			klog.Errorf("Failed to remove node <%s> from cache after label change: %s", nodeName, deleteErr.Error())
+			return deleteErr
+		}


The RemoveNode() function returns an error if the node doesn't exist in the cache (line 541). This causes SyncNode() to fail when nodeCanAddCache() returns false for a node that was never added to cache. This breaks the existing test case "Node not added to cache" which expects wantErr: false. The RemoveNode() call should only propagate errors that aren't "node does not exist" errors, or RemoveNode() should be modified to not error when the node is not found.

Copilot · 2026-03-31T08:04:54Z

pkg/scheduler/cache/event_handlers.go

+		if deleteErr := sc.RemoveNode(nodeName); deleteErr != nil {
+			klog.Errorf("Failed to remove node <%s> from cache after label change: %s", nodeName, deleteErr.Error())
+			return deleteErr
+		}


The PR description states that "RemoveNode() is safe to call even if the node is not in the cache (it will simply be a no-op)", but the implementation at line 540-541 returns an error if the node doesn't exist in cache. This is a discrepancy between the PR description and the actual code behavior.

Copilot · 2026-03-31T08:04:54Z

pkg/scheduler/cache/event_handlers_test.go

+	err = sc.SyncNode("n1")
+	assert.NoError(t, err)
+	assert.NotContains(t, sc.Nodes, "n1", "node should be removed from cache after label removal")
+}


The new test only covers the case where a node is removed from cache after label removal. It should also test the case where a node never matches the selector in the first place, to ensure no regression in existing behavior. This edge case is currently broken by the fix.

Suggested change

}

}

func TestSchedulerCache_SyncNode_NeverMatchesSelector(t *testing.T) {

// Simulate: node never has the matching label, so it should never be added to cache

n2 := util.BuildNode("n2", nil, map[string]string{})

sc := NewDefaultMockSchedulerCache("volcano")

sc.nodeSelectorLabels = map[string]sets.Empty{

"train:true": {},

}

// Add node without matching label to informer and sync

sc.nodeInformer.Informer().GetIndexer().Add(n2)

err := sc.SyncNode("n2")

assert.NoError(t, err)

assert.NotContains(t, sc.Nodes, "n2", "node without matching label should not be added to cache")

}

…moved When using --node-selector in volcano scheduler, removing the matching label from a node did not evict the node from the scheduler cache. SyncNode() checked nodeCanAddCache() but only returned nil without calling RemoveNode(), causing the scheduler to continue scheduling pods to nodes that no longer match the selector. Add RemoveNode() call when nodeCanAddCache() returns false, so the node is properly evicted from cache when its labels no longer match the configured node selector. Signed-off-by: brick-pixel <johnpixel128@gmail.com>

hajnalmt

Thanks for the PR @brick-pixel !
We will need a much more robust implementation for this with e2e verificaton.

Please check my comment here.

This reverts commit 59e1bb7.

volcano-sh-bot · 2026-03-31T12:14:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign k82cn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI review requested due to automatic review settings March 31, 2026 08:01

Copilot started reviewing on behalf of brick-pixel March 31, 2026 08:02 View session

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

Copilot AI reviewed Mar 31, 2026

View reviewed changes

brick-pixel force-pushed the fix/remove-node-on-label-change branch from c0c29ed to 1906734 Compare March 31, 2026 08:06

volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 31, 2026

hajnalmt suggested changes Mar 31, 2026

View reviewed changes

brick-pixel added 2 commits March 31, 2026 12:07

docs: add visual proof video (test)

59e1bb7

Revert "docs: add visual proof video (test)"

585f885

This reverts commit 59e1bb7.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): remove node from cache when node-selector label is removed#5145

fix(scheduler): remove node from cache when node-selector label is removed#5145
brick-pixel wants to merge 3 commits intovolcano-sh:masterfrom
brick-pixel:fix/remove-node-on-label-change

brick-pixel commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

hajnalmt left a comment

Uh oh!

volcano-sh-bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-}
+}
+func TestSchedulerCache_SyncNode_NeverMatchesSelector(t *testing.T) {
+	// Simulate: node never has the matching label, so it should never be added to cache
+	n2 := util.BuildNode("n2", nil, map[string]string{})
+	sc := NewDefaultMockSchedulerCache("volcano")
+	sc.nodeSelectorLabels = map[string]sets.Empty{
+		"train:true": {},
+	}
+	// Add node without matching label to informer and sync
+	sc.nodeInformer.Informer().GetIndexer().Add(n2)
+	err := sc.SyncNode("n2")
+	assert.NoError(t, err)
+	assert.NotContains(t, sc.Nodes, "n2", "node without matching label should not be added to cache")
+}

Conversation

brick-pixel commented Mar 31, 2026

What type of PR is this?

What this PR does / why we need it

Root Cause

Fix

Which issue(s) this PR fixes

Does this PR introduce a user-facing change?

How Has This Been Tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt left a comment

Choose a reason for hiding this comment

Uh oh!

volcano-sh-bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants