Fix queue status conflicts in multi-scheduler by adding per-scheduler allocation tracking by rishisulakhe · Pull Request #5135 · volcano-sh/volcano

rishisulakhe · 2026-03-28T20:42:02Z

What type of PR is this?

/kind bug

Which issue(s) this PR fixes:

Fixes #5084 (Solution 2)

What this PR does / why we need it:

When running multiple schedulers in Volcano v1.14.0+, each scheduler computes Queue.Status.Allocated from only its own local view, then does a full UpdateStatus() call. This causes two issues:

409 Conflict errors — schedulers race to update the same queue object
Inaccurate Allocated values —each scheduler overwrites the other's data, so the field never reflects the true total across all schedulers

This happens both in the Agent Scheduler + Volcano Scheduler setup and in hash-ring based multi-scheduler deployments.

What changed and why

New API field — QueueStatus.SchedulerAllocations

Added a map[string]SchedulerAllocation field where each key is a scheduler's identity (scheduler name, or pod name for hash-ring deployments). Each entry holds that scheduler's Allocated resources and a LastUpdateTime timestamp. This lets multiple schedulers write to the same queue without stepping on each other.

Scheduler side — Instead of calling UpdateStatus() (which replaces the entire status), each scheduler now calls ApplyStatus() (SSA) with a unique FieldManager per scheduler instance. This means scheduler-A writing its entry never conflicts with scheduler-B writing its entry — Kubernetes SSA handles this natively.

The queue controller's syncQueue() now sums up all active SchedulerAllocations entries into the existing Status.Allocated field. This keeps backward compatibility — any tool reading Status.Allocated still sees the correct total. Entries older than 5 minutes are considered stale and are excluded from the sum, then garbage-collected via merge patch.

The existing Status.Allocated field is preserved and still works exactly as before. Single-scheduler deployments see no behavior change — there's just one entry in the map, and the total equals that entry

volcano-sh-bot · 2026-03-28T20:42:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign k82cn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add SchedulerAllocation type and SchedulerAllocations field to QueueStatus Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

Copilot

Pull request overview

This PR addresses multi-scheduler queue status update conflicts in Volcano by introducing per-scheduler allocation reporting in QueueStatus, switching schedulers to Server-Side Apply (SSA) status updates, and having the queue controller aggregate those per-scheduler values back into the existing status.allocated field for backward compatibility.

Changes:

Add QueueStatus.SchedulerAllocations (map keyed by scheduler identity) plus the SchedulerAllocation type in the API.
Update scheduler session close logic to patch only the current scheduler’s allocation entry via SSA with a per-scheduler FieldManager.
Update queue controller to aggregate active scheduler entries into status.allocated and garbage-collect stale entries; add tests for aggregation logic.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
staging/src/volcano.sh/apis/pkg/client/applyconfiguration/utils.go	Registers apply-configuration kind lookup for `SchedulerAllocation`.
staging/src/volcano.sh/apis/pkg/client/applyconfiguration/scheduling/v1beta1/schedulerallocation.go	New generated apply-configuration type for SSA updates of scheduler allocation entries.
staging/src/volcano.sh/apis/pkg/client/applyconfiguration/scheduling/v1beta1/queuestatus.go	Extends queue status apply-config to include `schedulerAllocations` and builder method.
staging/src/volcano.sh/apis/pkg/apis/scheduling/zz_generated.deepcopy.go	Adds deep-copy support for `SchedulerAllocations` and `SchedulerAllocation` (internal API).
staging/src/volcano.sh/apis/pkg/apis/scheduling/v1beta1/zz_generated.deepcopy.go	Adds deep-copy support for `SchedulerAllocations` and `SchedulerAllocation` (v1beta1 API).
staging/src/volcano.sh/apis/pkg/apis/scheduling/v1beta1/zz_generated.conversion.go	Adds conversions for `SchedulerAllocation` and maps between v1beta1 and internal types.
staging/src/volcano.sh/apis/pkg/apis/scheduling/v1beta1/types.go	Introduces `QueueStatus.SchedulerAllocations` and `SchedulerAllocation` in v1beta1 API.
staging/src/volcano.sh/apis/pkg/apis/scheduling/types.go	Introduces `QueueStatus.SchedulerAllocations` and `SchedulerAllocation` in internal API.
pkg/scheduler/framework/session.go	Switches queue status update on session close to SSA patch of per-scheduler allocation entry.
pkg/scheduler/cache/interface.go	Adds cache API for scheduler identity and patching scheduler allocation via SSA.
pkg/scheduler/cache/cache.go	Implements scheduler identity and SSA-based `PatchSchedulerAllocation`.
pkg/scheduler/cache/cache_test.go	Adds unit test coverage for `SchedulerIdentity`.
pkg/controllers/queue/queue_controller_action.go	Aggregates `SchedulerAllocations` into `status.allocated` and garbage-collects stale entries.
pkg/controllers/queue/queue_controller_test.go	Adds tests for aggregation behavior and stale entry exclusion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-28T20:45:13Z

pkg/scheduler/cache/cache.go

+	if sc.c != nil && sc.schedulerPodName != "" {
+		return sc.schedulerPodName
+	}
+	return sc.schedulerNames[0]


SchedulerIdentity can panic if sc.schedulerNames is empty (sc.schedulerNames[0]). Even if current callers always pass at least one name, this is an exported Cache method and is used in multiple contexts (including tests/mocks). Please add a defensive check (e.g., return sc.schedulerPodName if set, otherwise return "" or a well-defined default / error) to avoid potential panics.

Suggested change

return sc.schedulerNames[0]

if len(sc.schedulerNames) > 0 {

return sc.schedulerNames[0]

}

return ""

Copilot · 2026-03-28T20:45:14Z

pkg/controllers/queue/queue_controller_test.go

+				assert.True(t, ok, "expected resource %s in Allocated", resName)
+				assert.True(t, expected.Equal(actual),
+					"resource %s: expected %s, got %s", resName, expected.String(), actual.String())
+			}


This test doesn't assert anything when ExpectAllocated is empty (the loop over tc.ExpectAllocated is a no-op), so it won't catch regressions where item.Status.Allocated incorrectly retains old values. Please add an assertion on the overall map length/content (e.g., assert.Empty(t, item.Status.Allocated) or assert.Equal(t, tc.ExpectAllocated, item.Status.Allocated)).

Suggested change

}

}

assert.Equal(t, tc.ExpectAllocated, item.Status.Allocated)

Copilot · 2026-03-28T20:45:14Z

pkg/controllers/queue/queue_controller_action.go

+
+	if stateChanged || allocatedChanged {
 		queueStatusApply := v1beta1apply.QueueStatus().WithState(queueStatus.State)
+		if len(aggregatedAllocated) > 0 {


When the aggregated allocation is empty (e.g., no active SchedulerAllocations or all entries stale), this code skips calling WithAllocated, so a previously-populated status.allocated will never be cleared. That can leave stale Allocated values indefinitely and also makes allocatedChanged true without actually updating the field. Consider always applying the Allocated field (including empty/nil) when allocatedChanged is true so the controller can clear it correctly.

Suggested change

if len(aggregatedAllocated) > 0 {

if allocatedChanged {

Copilot · 2026-03-28T20:45:14Z

pkg/controllers/queue/queue_controller_action.go

+	var staleKeys []string
+	staleTTL := 5 * time.Minute
+	now := time.Now()


staleTTL := 5 * time.Minute is a hard-coded policy value embedded in the controller loop. Please consider promoting this to a named constant (or controller config/flag) so it's easier to discover, document, and adjust without editing logic.

Copilot · 2026-03-28T20:45:14Z

pkg/controllers/queue/queue_controller_action.go

+		patchBytes, err := json.Marshal(patch)
+		if err == nil {
+			_, patchErr := c.vcClient.SchedulingV1beta1().Queues().Patch(
+				context.TODO(), queue.Name, types.MergePatchType, patchBytes, metav1.PatchOptions{}, "status")
+			if patchErr != nil {
+				klog.Errorf("Failed to clean up stale scheduler allocations for queue %s: %v", queue.Name, patchErr)
+			}
+		}


The merge-patch cleanup path ignores json.Marshal errors (it just silently skips cleanup). Please handle the marshal error explicitly (at least log it) so failures to GC stale schedulerAllocations entries are visible during operations.

gemini-code-assist

Code Review

This pull request introduces a per-scheduler allocation tracking system within the Queue status to support multi-scheduler and hash-ring deployments. It implements logic to aggregate individual scheduler allocations into a total Allocated field, adds a garbage collection mechanism for stale entries using a 5-minute TTL, and migrates status updates to use Server-Side Apply (SSA). The review feedback identifies a bug where the total allocation is not cleared when it becomes empty and suggests improvements for error handling during JSON marshaling, context propagation instead of using context.TODO(), and more rigorous unit testing to verify resource list sizes and clearing behavior.

gemini-code-assist · 2026-03-28T20:45:17Z

pkg/controllers/queue/queue_controller_action.go

 		queueStatusApply := v1beta1apply.QueueStatus().WithState(queueStatus.State)
+		if len(aggregatedAllocated) > 0 {
+			queueStatusApply = queueStatusApply.WithAllocated(aggregatedAllocated)
+		}


The logic for updating status.allocated is flawed. The check if len(aggregatedAllocated) > 0 prevents clearing status.allocated when it becomes empty (e.g., when all scheduler allocations expire). This can lead to an inaccurate status.allocated field that retains stale values. The WithAllocated method should be called whenever allocatedChanged is true.

queueStatusApply := v1beta1apply.QueueStatus().WithState(queueStatus.State).WithAllocated(aggregatedAllocated)

gemini-code-assist · 2026-03-28T20:45:17Z

pkg/controllers/queue/queue_controller_action.go

+	// Aggregate per-scheduler allocations into total Allocated and garbage-collect stale entries
+	aggregatedAllocated := v1.ResourceList{}
+	var staleKeys []string
+	staleTTL := 5 * time.Minute


The staleTTL is hardcoded to 5 * time.Minute. It's better to define this as a constant at the package level (e.g., staleSchedulerAllocationTTL) to improve maintainability and make it easier to change if needed.

gemini-code-assist · 2026-03-28T20:45:17Z

pkg/controllers/queue/queue_controller_action.go

+		patchBytes, err := json.Marshal(patch)
+		if err == nil {
+			_, patchErr := c.vcClient.SchedulingV1beta1().Queues().Patch(
+				context.TODO(), queue.Name, types.MergePatchType, patchBytes, metav1.PatchOptions{}, "status")
+			if patchErr != nil {
+				klog.Errorf("Failed to clean up stale scheduler allocations for queue %s: %v", queue.Name, patchErr)
+			}
+		}


The error from json.Marshal is silently ignored. If marshaling fails, the patch for cleaning up stale scheduler allocations will not be applied, and no error will be logged. This can lead to stale entries never being removed. The error should be handled by logging it.

patchBytes, err := json.Marshal(patch) if err != nil { klog.Errorf("Failed to marshal patch for stale scheduler allocations in queue %s: %v", queue.Name, err) } else { _, patchErr := c.vcClient.SchedulingV1beta1().Queues().Patch( context.TODO(), queue.Name, types.MergePatchType, patchBytes, metav1.PatchOptions{}, "status") if patchErr != nil { klog.Errorf("Failed to clean up stale scheduler allocations for queue %s: %v", queue.Name, patchErr) } }

gemini-code-assist · 2026-03-28T20:45:17Z

pkg/controllers/queue/queue_controller_test.go

+			Name: "Empty scheduler allocations produces empty Allocated",
+			queue: &schedulingv1beta1.Queue{
+				ObjectMeta: metav1.ObjectMeta{
+					Name: "root",
+				},
+				Status: schedulingv1beta1.QueueStatus{
+					State: schedulingv1beta1.QueueStateOpen,
+				},
+			},
+			ExpectAllocated: v1.ResourceList{},
+		},


The test case "Empty scheduler allocations produces empty Allocated" is not fully testing the scenario it's named after. It would pass even with a bug where an empty aggregatedAllocated fails to clear a pre-existing Status.Allocated. To make this test more robust, the initial queue.Status.Allocated should be non-empty, and the test should assert that it becomes empty.

Name: "Empty scheduler allocations produces empty Allocated", queue: &schedulingv1beta1.Queue{ ObjectMeta: metav1.ObjectMeta{ Name: "root", }, Status: schedulingv1beta1.QueueStatus{ State: schedulingv1beta1.QueueStateOpen, Allocated: v1.ResourceList{ v1.ResourceCPU: resource.MustParse("1"), }, }, }, ExpectAllocated: v1.ResourceList{},

gemini-code-assist · 2026-03-28T20:45:17Z

pkg/controllers/queue/queue_controller_test.go

+			for resName, expected := range tc.ExpectAllocated {
+				actual, ok := item.Status.Allocated[resName]
+				assert.True(t, ok, "expected resource %s in Allocated", resName)
+				assert.True(t, expected.Equal(actual),
+					"resource %s: expected %s, got %s", resName, expected.String(), actual.String())
+			}


The current assertion only checks for the presence and value of expected resources. It does not verify if there are unexpected extra resources in item.Status.Allocated. To make the test more robust, you should also assert that the number of resource types in the actual and expected Allocated lists is the same.

assert.Len(t, item.Status.Allocated, len(tc.ExpectAllocated), "mismatched number of resource types in Allocated") for resName, expected := range tc.ExpectAllocated { actual, ok := item.Status.Allocated[resName] assert.True(t, ok, "expected resource %s in Allocated", resName) assert.True(t, expected.Equal(actual), "resource %s: expected %s, got %s", resName, expected.String(), actual.String()) }

gemini-code-assist · 2026-03-28T20:45:17Z

pkg/scheduler/cache/cache.go

+
+	fieldManager := "volcano-scheduler-" + identity
+	_, err := sc.vcClient.SchedulingV1beta1().Queues().ApplyStatus(
+		context.TODO(), queueApply, metav1.ApplyOptions{FieldManager: fieldManager})


Using context.TODO() is generally discouraged as it provides an empty, non-cancellable context which can lead to issues with request tracing, cancellation, and deadline propagation. It would be better to pass a context.Context from the caller. Please consider adding a context.Context parameter to PatchSchedulerAllocation and passing it to the ApplyStatus call. This would require updating the Cache interface and the call site in pkg/scheduler/framework/session.go.

gemini-code-assist · 2026-03-28T20:45:17Z

pkg/scheduler/framework/session.go

+		if err := ssn.cache.PatchSchedulerAllocation(
+			string(queueID), identity, queueAllocated); err != nil {
+			klog.Errorf("failed to patch scheduler allocation for queue <%s>: %s",
+				ssn.Queues[queueID].Name, err.Error())
 		}


Using context.TODO() is generally discouraged. It would be better to pass a context.Context down from the updateQueueStatus function's caller. Since PatchSchedulerAllocation is being modified to accept a context, this call site should be updated to pass a meaningful context instead of context.TODO().

Suggested change

if err := ssn.cache.PatchSchedulerAllocation(

string(queueID), identity, queueAllocated); err != nil {

klog.Errorf("failed to patch scheduler allocation for queue <%s>: %s",

ssn.Queues[queueID].Name, err.Error())

}

if err := ssn.cache.PatchSchedulerAllocation(

context.TODO(), string(queueID), identity, queueAllocated); err != nil {

klog.Errorf("failed to patch scheduler allocation for queue <%s>: %s",

ssn.Queues[queueID].Name, err.Error())

}

rishisulakhe · 2026-03-30T12:51:31Z

Hi @hajnalmt @JesseStutler Could you pls review this?

Copilot AI review requested due to automatic review settings March 28, 2026 20:42

volcano-sh-bot requested review from hwdef, hzxuzhonghu and k82cn March 28, 2026 20:42

volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2026

Copilot started reviewing on behalf of rishisulakhe March 28, 2026 20:42 View session

rishisulakhe changed the title ~~Fix/issue 5084~~ Fix queue status conflicts in multi-scheduler by adding per-scheduler allocation tracking Mar 28, 2026

rishisulakhe added 8 commits March 28, 2026 20:44

Add new API types

5a283e7

Add SchedulerAllocation type and SchedulerAllocations field to QueueStatus Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

code generation - deepcopy, conversion btw internal and v1beta1

bf2a1c3

Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

Expose scheduler identity via Cache interface

fe2069e

Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

Add new method to SchedulerCache

a740d01

Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

Rewrite updateQueueStatus() in session.go

c8611cc

Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

Queue controller — aggregate and garbage-collect

d6ee9e0

Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

unit tests

1c41127

Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

fix

1dcee55

Signed-off-by: Rishi Prasad Sulakhe <rishiprasadsulakhe@gmail.com>

rishisulakhe force-pushed the fix/issue-5084 branch from 7737dcc to 1dcee55 Compare March 28, 2026 20:45

Copilot AI reviewed Mar 28, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 28, 2026

View reviewed changes

	}
	}
	assert.Equal(t, tc.ExpectAllocated, item.Status.Allocated)

Conversation

rishisulakhe commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

What changed and why

Uh oh!

volcano-sh-bot commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

rishisulakhe commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rishisulakhe commented Mar 28, 2026 •

edited

Loading