Fix Agent Scheduler multi worker optimistic parallel scheduling concurrently conflict error by JesseStutler · Pull Request #5154 · volcano-sh/volcano

JesseStutler · 2026-04-02T13:11:48Z

What type of PR is this?

/kind bug
/priority high

What this PR does / why we need it:

As #5146 mentioned, when opening multiple workers optimistic parallel scheduling can lead to concurrent map read/write conflicts and cycleState overwriting issues. This is mainly because the Action is a singleton with only one framework variable, while different workers have their own independent frameworks, potentially causing concurrency conflicts when working simultaneously: #5147 (comment)

Besides, lastProcessedNodeIndex is simply an int var, which will also have conflict in multi-workers scenario, should also update to use atomic

Which issue(s) this PR fixes:

Fixes #5146 #5147

Special notes for your reviewer:

Validation:
Test Yamls:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bench
  namespace: default
spec:
  replicas: 1000
  selector:
    matchLabels:
      app: bench
  template:
    metadata:
      labels:
        app: bench
      annotations:
        foo: bar
    spec:
      schedulerName: agent-scheduler
      containers:
        - name: nginx
          image: nginx:alpine
      volumes:
        - name: agent-runtime-volume
          emptyDir: {}

After create 1000 replicas, the scheduler didn't meet panic errors anymore

Does this PR introduce a user-facing change?

volcano-sh-bot · 2026-04-02T13:11:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jessestutler. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request refactors the allocate action to be stateless by passing the Framework instance as a method argument, enabling concurrent scheduling across multiple workers. It also introduces a new concurrent stress test and converts the global lastProcessedNodeIndex to an atomic type. Feedback was provided regarding a race condition and a logic bug in the non-atomic update of lastProcessedNodeIndex, which could reset the fairness heuristic when checking nominated nodes.

gemini-code-assist · 2026-04-02T13:15:09Z

pkg/scheduler/util/predicate_helper.go


 	//processedNodes := int(numFoundNodes) + len(filteredNodesStatuses) + len(failedPredicateMap)
-	lastProcessedNodeIndex = (lastProcessedNodeIndex + int(processedNodes)) % allNodes
+	lastProcessedNodeIndex.Store(int64((int(lastProcessedNodeIndex.Load()) + int(processedNodes)) % allNodes))


The update to lastProcessedNodeIndex is not atomic and contains a logic bug when used with varying node list sizes.

Race Condition: The sequence of Load followed by Store is not an atomic operation. In a multi-worker environment, concurrent updates can lead to lost progress in the node iteration heuristic because one worker's update can overwrite another's.

Reset Bug: When allNodes is 1 (which occurs during the nominated node check in allocate.go), the modulo operation (index + processed) % 1 will always result in 0. This resets the global starting index for the entire cluster to zero, effectively breaking the fairness heuristic for subsequent scheduling cycles.

It is recommended to use a CompareAndSwap loop for atomicity and skip the update if allNodes <= 1.

if allNodes > 1 { for { old := lastProcessedNodeIndex.Load() newVal := (old + int64(processedNodes)) % int64(allNodes) if lastProcessedNodeIndex.CompareAndSwap(old, newVal) { break } } }

I think it's ok to overwrite currently, different workers will have the same numsToFind value, so the scanning speed of nodes is the same. Even if their processing speeds differ, which might cause the fastest worker to overwrite, at least all nodes will be scanned once. However, CAS might cause some nodes to skip the scan.

JesseStutler · 2026-04-02T13:15:58Z

/cc @wangyang0616 @qi-min @JBinin

volcano-sh-bot · 2026-04-02T13:16:04Z

@JesseStutler: GitHub didn't allow me to request PR reviews from the following users: JBinin, qi-min.

Note that only volcano-sh members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @wangyang0616 @qi-min @JBinin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot

Pull request overview

This PR addresses agent-scheduler panics seen with multiple scheduling workers by removing shared mutable scheduling state across workers and hardening a shared scheduler util against concurrent access.

Changes:

Make lastProcessedNodeIndex atomic and update predicate helper to use atomic loads/stores to avoid cross-worker data races.
Refactor agent-scheduler allocate action to stop storing a framework pointer on the shared Action instance; pass fwk through the call stack instead.
Improve agent-scheduler test harness to support multi-worker frameworks and add a concurrent scheduling stress test.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/scheduler/util/scheduler_helper.go	Switch `lastProcessedNodeIndex` to `atomic.Int64` for safe cross-worker access.
pkg/scheduler/util/predicate_helper.go	Use atomic load/store for `lastProcessedNodeIndex` when iterating nodes.
pkg/agentscheduler/uthelper/helper.go	Update test harness to build multiple frameworks and centralize plugin/option init.
pkg/agentscheduler/cache/cache_mock.go	Initialize `taskCache` in mock cache to match real cache behavior.
pkg/agentscheduler/actions/allocate/allocate.go	Remove shared `fwk` field from Action; thread framework through execution to avoid worker interference.
pkg/agentscheduler/actions/allocate/allocate_test.go	Add concurrent multi-worker scheduling test.
pkg/agentscheduler/actions/allocate/allocate_shard_test.go	Adapt shard test to updated test framework helper API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/agentscheduler/uthelper/helper.go

pkg/agentscheduler/actions/allocate/allocate_test.go

…ycleState Signed-off-by: JesseStutler <[email protected]>

Signed-off-by: JesseStutler <[email protected]>

qi-min · 2026-04-03T07:20:32Z

pkg/scheduler/util/predicate_helper.go

-	//processedNodes := int(numFoundNodes) + len(filteredNodesStatuses) + len(failedPredicateMap)
-	lastProcessedNodeIndex = (lastProcessedNodeIndex + int(processedNodes)) % allNodes
+	newIndex := int64((startIndex + int(processedNodes)) % allNodes)
+	lastProcessedNodeIndex.Store(newIndex)


lastProcessedNodeIndex might be overwrote by different worker because worker predicate nodes concurrently and processedNodes number are different in each cycle in worker.
In this case, is it better to keep lastProcessedNodeIndex in worker scope and count index separately?

Copilot AI review requested due to automatic review settings April 2, 2026 13:11

volcano-sh-bot added kind/bug Categorizes issue or PR as related to a bug. priority/high labels Apr 2, 2026

volcano-sh-bot requested review from hajnalmt and merryzhou April 2, 2026 13:11

volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 2, 2026

Copilot started reviewing on behalf of JesseStutler April 2, 2026 13:12 View session

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

volcano-sh-bot requested a review from wangyang0616 April 2, 2026 13:16

Copilot AI reviewed Apr 2, 2026

View reviewed changes

pkg/agentscheduler/uthelper/helper.go Show resolved Hide resolved

pkg/agentscheduler/actions/allocate/allocate_test.go Outdated Show resolved Hide resolved

JesseStutler added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2026

Fix: single action instance share different framework and different c…

3be2531

…ycleState Signed-off-by: JesseStutler <[email protected]>

JesseStutler force-pushed the agent_scheduler_fix branch from 474194d to 0c7bae1 Compare April 3, 2026 06:44

JesseStutler removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2026

Add ut tests for concurrent worker in agent scheduler

3ba4b06

Signed-off-by: JesseStutler <[email protected]>

JesseStutler force-pushed the agent_scheduler_fix branch from 0c7bae1 to 3ba4b06 Compare April 3, 2026 07:03

qi-min reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Agent Scheduler multi worker optimistic parallel scheduling concurrently conflict error#5154

Fix Agent Scheduler multi worker optimistic parallel scheduling concurrently conflict error#5154
JesseStutler wants to merge 2 commits intovolcano-sh:masterfrom
JesseStutler:agent_scheduler_fix

JesseStutler commented Apr 2, 2026 •

edited

Loading

Uh oh!

volcano-sh-bot commented Apr 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Uh oh!

JesseStutler Apr 2, 2026 •

edited

Loading

Uh oh!

JesseStutler commented Apr 2, 2026

Uh oh!

volcano-sh-bot commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

qi-min Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JesseStutler commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

volcano-sh-bot commented Apr 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

JesseStutler Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JesseStutler commented Apr 2, 2026

Uh oh!

volcano-sh-bot commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

qi-min Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JesseStutler commented Apr 2, 2026 •

edited

Loading

JesseStutler Apr 2, 2026 •

edited

Loading