fix(capacity): avoid double-counting inqueue resources for jobs with tasks in Binding state#5130
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug in the capacity scheduler plugin that led to inaccurate resource accounting. Previously, jobs with tasks in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
7d1f39b to
b642eba
Compare
There was a problem hiding this comment.
Code Review
This pull request addresses a double-counting bug in the capacity plugin that could lead to queues appearing over-capacity and blocking schedulable jobs. The fix involves deducting already-allocated task resources from minResources when calculating inqueue resources for jobs with tasks in the Binding state. This correction is applied in two locations within the capacity plugin: buildQueueAttrs and buildHierarchicalQueueAttrs. Additionally, a new test case, TestNoDoubleCountingForInqueueJobWithBindingTasks, is added to verify the fix.
| // calculate inqueue resource for inqueue jobs | ||
| // deduct already-allocated task resources from minResources to avoid double-counting: | ||
| // tasks in Allocated/Binding state are already tracked in attr.allocated (via AllocatedStatus), | ||
| // but the PodGroup stays Inqueue until tasks reach Running/Bound (ScheduledStatus). | ||
| // Without this deduction, the same resources appear in both attr.allocated and attr.inqueue. |
There was a problem hiding this comment.
There was a problem hiding this comment.
Please move the comment above the if statement. 👍
| // same double-counting fix as buildQueueAttrs: deduct already-allocated resources | ||
| // so tasks in Allocated/Binding state are not counted in both attr.allocated and attr.inqueue. |
There was a problem hiding this comment.
Yes, please move this comment above the if too.
There was a problem hiding this comment.
Pull request overview
Fixes capacity plugin queue accounting to prevent double-counting resources for jobs whose tasks have been allocated (incl. Binding) while the PodGroup phase is still Inqueue, which can incorrectly make queues appear over-capacity and block other jobs.
Changes:
- Update
PodGroupInqueuehandling inbuildQueueAttrsandbuildHierarchicalQueueAttrsto compute inqueue resources viautil.GetInqueueResource(job, job.Allocated)(deducting already-allocated resources). - Add a regression test intended to cover the double-counting scenario in the capacity plugin.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/scheduler/plugins/capacity/capacity.go | Prevents double-counting by subtracting already-allocated resources when computing attr.inqueue for PodGroupInqueue jobs (both flat + hierarchical paths). |
| pkg/scheduler/plugins/capacity/capacity_test.go | Adds a regression test to ensure another job can still be enqueued/bound when a prior job’s resources are already allocated. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hey @JesseStutler @hajnalmt @hzxuzhonghu, this is the follow-up @hajnalmt mentioned on #5100, same fix ported to the capacity plugin. Two spots instead of one since the pattern shows up in both |
b642eba to
c9828cf
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cycle1 := uthelper.WaitForBinds(t, binder.Channel, 3, 5*time.Second) | ||
| for _, pod := range []string{"ns1/pA1", "ns1/pA2", "ns1/pA3"} { |
There was a problem hiding this comment.
uthelper.WaitForBinds is referenced here, but there is no WaitForBinds function in pkg/scheduler/uthelper (repo-wide search only finds these call sites). This will fail to compile. Consider adding a local helper (like waitForBinds in plugins/proportion/proportion_test.go) or switching to an existing helper in this package (if available).
| // same double-counting fix as buildQueueAttrs: deduct already-allocated resources | ||
| // so tasks in Allocated/Binding state are not counted in both attr.allocated and attr.inqueue. | ||
| if job.PodGroup.Status.Phase == scheduling.PodGroupInqueue { | ||
| attr.inqueue.Add(job.DeductSchGatedResources(job.GetMinResources())) | ||
| if job.PodGroup.Spec.MinResources != nil { | ||
| inqueued := util.GetInqueueResource(job, job.Allocated) | ||
| attr.inqueue.Add(job.DeductSchGatedResources(inqueued)) | ||
| } |
There was a problem hiding this comment.
This change also updates hierarchical accounting in buildHierarchicalQueueAttrs, but the new regression test only exercises the non-hierarchical path (buildQueueAttrs). Adding a hierarchical variant of this regression (enable EnabledHierarchy in tiers and use a parent/leaf queue) would help ensure the same double-counting bug can’t reappear in hierarchical mode.
c9828cf to
68c3935
Compare
hajnalmt
left a comment
There was a problem hiding this comment.
Thanks!
I had minor comments only.
| // calculate inqueue resource for inqueue jobs | ||
| // deduct already-allocated task resources from minResources to avoid double-counting: | ||
| // tasks in Allocated/Binding state are already tracked in attr.allocated (via AllocatedStatus), | ||
| // but the PodGroup stays Inqueue until tasks reach Running/Bound (ScheduledStatus). | ||
| // Without this deduction, the same resources appear in both attr.allocated and attr.inqueue. |
There was a problem hiding this comment.
Please move the comment above the if statement. 👍
| // same double-counting fix as buildQueueAttrs: deduct already-allocated resources | ||
| // so tasks in Allocated/Binding state are not counted in both attr.allocated and attr.inqueue. |
There was a problem hiding this comment.
Yes, please move this comment above the if too.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
14e5225 to
c2baa2a
Compare
There was a problem hiding this comment.
Thanks,
Please check gofmt. The CI is failing
hack/verify-gofmt.sh
diff ./pkg/scheduler/plugins/proportion/proportion_test.go.orig ./pkg/scheduler/plugins/proportion/proportion_test.go
--- ./pkg/scheduler/plugins/proportion/proportion_test.go.orig
+++ ./pkg/scheduler/plugins/proportion/proportion_test.go
@@ -47,7 +47,6 @@
"volcano.sh/volcano/pkg/scheduler/util"
)
-
func TestMain(m *testing.M) {
options.Default()
os.Exit(m.Run())
make: *** [Makefile:255: verify] Error 1
Error: Process completed with exit code 2.
Also one of your commit message is not proper. You cannot have ats(@) in commit messages. This is not valid:
Also drop the @mention from the capacity test comment.
c2baa2a to
04f943f
Compare
…tasks in Binding state Signed-off-by: Aman-Cool <aman017102007@gmail.com>
The helper was copy-pasted across proportion and capacity plugin tests. Move it to uthelper.WaitForBinds so both packages share the same implementation. Also drop the mention from the capacity test comment. Signed-off-by: Aman-Cool <aman017102007@gmail.com>
Move the double-counting explanation comments to sit directly above the MinResources guard inside the PodGroupInqueue block, as requested in review. Signed-off-by: Aman-Cool <aman017102007@gmail.com>
04f943f to
428d7b8
Compare
|
hey @hajnalmt, sorry for pinging you before the CI checks were completed. |
What type of PR is this?
Bug fix
What this PR does / why we need it:
Follow-up to #5100, same double-counting bug, same fix, but in the capacity plugin. @hajnalmt flagged this during review of that PR.
When a job's tasks land in
Binding, they show up inattr.allocated(becauseBindingis part ofAllocatedStatus). But thePodGroupstaysInqueueuntil those tasks actually reachRunning, so thePodGroupInqueuebranch was also adding the fullminResourcestoattr.inqueue. Same resources counted twice, queue looks over-capacity, other jobs pile up inPendingfor no reason.The
PodGroupRunningbranch already does this right, it callsutil.GetInqueueResource(job, job.Allocated)which subtracts what's already allocated before adding to inqueue. Just needed to do the same forPodGroupInqueue. The capacity plugin has this pattern in two places (buildQueueAttrsandbuildHierarchicalQueueAttrs), both fixed here.Which issue(s) this PR fixes:
Fixes #5099
Special notes for your reviewer:
Both spots in the capacity plugin are fixed,
buildQueueAttrsandbuildHierarchicalQueueAttrs. Logic is identical to what landed in #5100 for proportion, just applied here.Does this PR introduce a user-facing change?
Fix capacity plugin double-counting inqueue resources for jobs with tasks in
Bindingstate, which could cause queues to appear over-capacity and block schedulable jobs.