Fix Agent Scheduler multi worker optimistic parallel scheduling concurrently conflict error#5154
Fix Agent Scheduler multi worker optimistic parallel scheduling concurrently conflict error#5154JesseStutler wants to merge 2 commits intovolcano-sh:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request refactors the allocate action to be stateless by passing the Framework instance as a method argument, enabling concurrent scheduling across multiple workers. It also introduces a new concurrent stress test and converts the global lastProcessedNodeIndex to an atomic type. Feedback was provided regarding a race condition and a logic bug in the non-atomic update of lastProcessedNodeIndex, which could reset the fairness heuristic when checking nominated nodes.
|
|
||
| //processedNodes := int(numFoundNodes) + len(filteredNodesStatuses) + len(failedPredicateMap) | ||
| lastProcessedNodeIndex = (lastProcessedNodeIndex + int(processedNodes)) % allNodes | ||
| lastProcessedNodeIndex.Store(int64((int(lastProcessedNodeIndex.Load()) + int(processedNodes)) % allNodes)) |
There was a problem hiding this comment.
The update to lastProcessedNodeIndex is not atomic and contains a logic bug when used with varying node list sizes.
- Race Condition: The sequence of
Loadfollowed byStoreis not an atomic operation. In a multi-worker environment, concurrent updates can lead to lost progress in the node iteration heuristic because one worker's update can overwrite another's. - Reset Bug: When
allNodesis 1 (which occurs during the nominated node check inallocate.go), the modulo operation(index + processed) % 1will always result in 0. This resets the global starting index for the entire cluster to zero, effectively breaking the fairness heuristic for subsequent scheduling cycles.
It is recommended to use a CompareAndSwap loop for atomicity and skip the update if allNodes <= 1.
if allNodes > 1 {
for {
old := lastProcessedNodeIndex.Load()
newVal := (old + int64(processedNodes)) % int64(allNodes)
if lastProcessedNodeIndex.CompareAndSwap(old, newVal) {
break
}
}
}There was a problem hiding this comment.
I think it's ok to overwrite currently, different workers will have the same numsToFind value, so the scanning speed of nodes is the same. Even if their processing speeds differ, which might cause the fastest worker to overwrite, at least all nodes will be scanned once. However, CAS might cause some nodes to skip the scan.
|
@JesseStutler: GitHub didn't allow me to request PR reviews from the following users: JBinin, qi-min. Note that only volcano-sh members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Pull request overview
This PR addresses agent-scheduler panics seen with multiple scheduling workers by removing shared mutable scheduling state across workers and hardening a shared scheduler util against concurrent access.
Changes:
- Make
lastProcessedNodeIndexatomic and update predicate helper to use atomic loads/stores to avoid cross-worker data races. - Refactor agent-scheduler
allocateaction to stop storing a framework pointer on the shared Action instance; passfwkthrough the call stack instead. - Improve agent-scheduler test harness to support multi-worker frameworks and add a concurrent scheduling stress test.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/scheduler/util/scheduler_helper.go | Switch lastProcessedNodeIndex to atomic.Int64 for safe cross-worker access. |
| pkg/scheduler/util/predicate_helper.go | Use atomic load/store for lastProcessedNodeIndex when iterating nodes. |
| pkg/agentscheduler/uthelper/helper.go | Update test harness to build multiple frameworks and centralize plugin/option init. |
| pkg/agentscheduler/cache/cache_mock.go | Initialize taskCache in mock cache to match real cache behavior. |
| pkg/agentscheduler/actions/allocate/allocate.go | Remove shared fwk field from Action; thread framework through execution to avoid worker interference. |
| pkg/agentscheduler/actions/allocate/allocate_test.go | Add concurrent multi-worker scheduling test. |
| pkg/agentscheduler/actions/allocate/allocate_shard_test.go | Adapt shard test to updated test framework helper API. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ycleState Signed-off-by: JesseStutler <[email protected]>
474194d to
0c7bae1
Compare
Signed-off-by: JesseStutler <[email protected]>
0c7bae1 to
3ba4b06
Compare
| //processedNodes := int(numFoundNodes) + len(filteredNodesStatuses) + len(failedPredicateMap) | ||
| lastProcessedNodeIndex = (lastProcessedNodeIndex + int(processedNodes)) % allNodes | ||
| newIndex := int64((startIndex + int(processedNodes)) % allNodes) | ||
| lastProcessedNodeIndex.Store(newIndex) |
There was a problem hiding this comment.
lastProcessedNodeIndex might be overwrote by different worker because worker predicate nodes concurrently and processedNodes number are different in each cycle in worker.
In this case, is it better to keep lastProcessedNodeIndex in worker scope and count index separately?
What type of PR is this?
/kind bug
/priority high
What this PR does / why we need it:
As #5146 mentioned, when opening multiple workers optimistic parallel scheduling can lead to concurrent map read/write conflicts and cycleState overwriting issues. This is mainly because the Action is a singleton with only one framework variable, while different workers have their own independent frameworks, potentially causing concurrency conflicts when working simultaneously: #5147 (comment)
Besides,
lastProcessedNodeIndexis simply an int var, which will also have conflict in multi-workers scenario, should also update to use atomicWhich issue(s) this PR fixes:
Fixes #5146 #5147
Special notes for your reviewer:
Validation:
Test Yamls:
After create 1000 replicas, the scheduler didn't meet panic errors anymore
Does this PR introduce a user-facing change?