Skip to content

fix(capacity): avoid double-counting inqueue resources for jobs with tasks in Binding state#5130

Open
Aman-Cool wants to merge 3 commits intovolcano-sh:masterfrom
Aman-Cool:fix/capacity-inqueue-double-counting
Open

fix(capacity): avoid double-counting inqueue resources for jobs with tasks in Binding state#5130
Aman-Cool wants to merge 3 commits intovolcano-sh:masterfrom
Aman-Cool:fix/capacity-inqueue-double-counting

Conversation

@Aman-Cool
Copy link
Copy Markdown
Contributor

What type of PR is this?
Bug fix

What this PR does / why we need it:

Follow-up to #5100, same double-counting bug, same fix, but in the capacity plugin. @hajnalmt flagged this during review of that PR.

When a job's tasks land in Binding, they show up in attr.allocated (because Binding is part of AllocatedStatus). But the PodGroup stays Inqueue until those tasks actually reach Running, so the PodGroupInqueue branch was also adding the full minResources to attr.inqueue. Same resources counted twice, queue looks over-capacity, other jobs pile up in Pending for no reason.

The PodGroupRunning branch already does this right, it calls util.GetInqueueResource(job, job.Allocated) which subtracts what's already allocated before adding to inqueue. Just needed to do the same for PodGroupInqueue. The capacity plugin has this pattern in two places (buildQueueAttrs and buildHierarchicalQueueAttrs), both fixed here.

Which issue(s) this PR fixes:
Fixes #5099

Special notes for your reviewer:
Both spots in the capacity plugin are fixed, buildQueueAttrs and buildHierarchicalQueueAttrs. Logic is identical to what landed in #5100 for proportion, just applied here.

Does this PR introduce a user-facing change?

Fix capacity plugin double-counting inqueue resources for jobs with tasks in Binding state, which could cause queues to appear over-capacity and block schedulable jobs.

Copilot AI review requested due to automatic review settings March 23, 2026 21:06
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the capacity scheduler plugin that led to inaccurate resource accounting. Previously, jobs with tasks in the Binding state would have their resources double-counted, causing queues to appear over-capacity and unnecessarily blocking other schedulable jobs. The fix ensures that already-allocated resources are properly subtracted when calculating in-queue resources, thereby providing a more accurate representation of available capacity and improving job scheduling efficiency.

Highlights

  • Bug Fix: Addressed a double-counting bug in the capacity plugin where in-queue resources were incorrectly calculated for jobs with tasks in the Binding state.
  • Resource Calculation Logic: Modified buildQueueAttrs and buildHierarchicalQueueAttrs functions to use util.GetInqueueResource to deduct already-allocated task resources, preventing resources from being counted in both attr.allocated and attr.inqueue.
  • Test Coverage: Introduced a new regression test (TestNoDoubleCountingForInqueueJobWithBindingTasks) to validate the fix, ensuring that jobs are correctly enqueued when capacity is available.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 23, 2026
@Aman-Cool Aman-Cool changed the title fix(capacity): avoid double-counting inqueue resources for jobs with … fix(capacity): avoid double-counting inqueue resources for jobs with tasks in Binding state Mar 23, 2026
@Aman-Cool Aman-Cool force-pushed the fix/capacity-inqueue-double-counting branch from 7d1f39b to b642eba Compare March 23, 2026 21:08
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a double-counting bug in the capacity plugin that could lead to queues appearing over-capacity and blocking schedulable jobs. The fix involves deducting already-allocated task resources from minResources when calculating inqueue resources for jobs with tasks in the Binding state. This correction is applied in two locations within the capacity plugin: buildQueueAttrs and buildHierarchicalQueueAttrs. Additionally, a new test case, TestNoDoubleCountingForInqueueJobWithBindingTasks, is added to verify the fix.

Comment on lines +511 to +515
// calculate inqueue resource for inqueue jobs
// deduct already-allocated task resources from minResources to avoid double-counting:
// tasks in Allocated/Binding state are already tracked in attr.allocated (via AllocatedStatus),
// but the PodGroup stays Inqueue until tasks reach Running/Bound (ScheduledStatus).
// Without this deduction, the same resources appear in both attr.allocated and attr.inqueue.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is helpful in explaining the rationale behind the code change, specifically addressing the double-counting issue. However, it might be more effective to move this comment block directly above the if statement on line 516 to provide immediate context for the conditional logic.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the comment above the if statement. 👍

Comment on lines +649 to +650
// same double-counting fix as buildQueueAttrs: deduct already-allocated resources
// so tasks in Allocated/Binding state are not counted in both attr.allocated and attr.inqueue.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment explains the purpose of the fix. Consider moving this comment block directly above the if statement on line 651 to provide immediate context for the conditional logic.

Copy link
Copy Markdown
Member

@hajnalmt hajnalmt Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please move this comment above the if too.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes capacity plugin queue accounting to prevent double-counting resources for jobs whose tasks have been allocated (incl. Binding) while the PodGroup phase is still Inqueue, which can incorrectly make queues appear over-capacity and block other jobs.

Changes:

  • Update PodGroupInqueue handling in buildQueueAttrs and buildHierarchicalQueueAttrs to compute inqueue resources via util.GetInqueueResource(job, job.Allocated) (deducting already-allocated resources).
  • Add a regression test intended to cover the double-counting scenario in the capacity plugin.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
pkg/scheduler/plugins/capacity/capacity.go Prevents double-counting by subtracting already-allocated resources when computing attr.inqueue for PodGroupInqueue jobs (both flat + hierarchical paths).
pkg/scheduler/plugins/capacity/capacity_test.go Adds a regression test to ensure another job can still be enqueued/bound when a prior job’s resources are already allocated.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Aman-Cool
Copy link
Copy Markdown
Contributor Author

Hey @JesseStutler @hajnalmt @hzxuzhonghu, this is the follow-up @hajnalmt mentioned on #5100, same fix ported to the capacity plugin. Two spots instead of one since the pattern shows up in both buildQueueAttrs and buildHierarchicalQueueAttrs. Regression test included, all existing tests still pass.

@Aman-Cool Aman-Cool force-pushed the fix/capacity-inqueue-double-counting branch from b642eba to c9828cf Compare March 23, 2026 21:30
@Aman-Cool Aman-Cool requested a review from Copilot March 23, 2026 21:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1160 to +1161
cycle1 := uthelper.WaitForBinds(t, binder.Channel, 3, 5*time.Second)
for _, pod := range []string{"ns1/pA1", "ns1/pA2", "ns1/pA3"} {
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uthelper.WaitForBinds is referenced here, but there is no WaitForBinds function in pkg/scheduler/uthelper (repo-wide search only finds these call sites). This will fail to compile. Consider adding a local helper (like waitForBinds in plugins/proportion/proportion_test.go) or switching to an existing helper in this package (if available).

Copilot uses AI. Check for mistakes.
Comment on lines +649 to +655
// same double-counting fix as buildQueueAttrs: deduct already-allocated resources
// so tasks in Allocated/Binding state are not counted in both attr.allocated and attr.inqueue.
if job.PodGroup.Status.Phase == scheduling.PodGroupInqueue {
attr.inqueue.Add(job.DeductSchGatedResources(job.GetMinResources()))
if job.PodGroup.Spec.MinResources != nil {
inqueued := util.GetInqueueResource(job, job.Allocated)
attr.inqueue.Add(job.DeductSchGatedResources(inqueued))
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change also updates hierarchical accounting in buildHierarchicalQueueAttrs, but the new regression test only exercises the non-hierarchical path (buildQueueAttrs). Adding a hierarchical variant of this regression (enable EnabledHierarchy in tiers and use a parent/leaf queue) would help ensure the same double-counting bug can’t reappear in hierarchical mode.

Copilot uses AI. Check for mistakes.
@Aman-Cool Aman-Cool force-pushed the fix/capacity-inqueue-double-counting branch from c9828cf to 68c3935 Compare March 23, 2026 21:45
Copy link
Copy Markdown
Member

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I had minor comments only.

Comment on lines +511 to +515
// calculate inqueue resource for inqueue jobs
// deduct already-allocated task resources from minResources to avoid double-counting:
// tasks in Allocated/Binding state are already tracked in attr.allocated (via AllocatedStatus),
// but the PodGroup stays Inqueue until tasks reach Running/Bound (ScheduledStatus).
// Without this deduction, the same resources appear in both attr.allocated and attr.inqueue.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the comment above the if statement. 👍

Comment on lines +649 to +650
// same double-counting fix as buildQueueAttrs: deduct already-allocated resources
// so tasks in Allocated/Binding state are not counted in both attr.allocated and attr.inqueue.
Copy link
Copy Markdown
Member

@hajnalmt hajnalmt Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please move this comment above the if too.

@volcano-sh-bot volcano-sh-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 30, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Aman-Cool Aman-Cool force-pushed the fix/capacity-inqueue-double-counting branch 2 times, most recently from 14e5225 to c2baa2a Compare March 30, 2026 20:13
@Aman-Cool Aman-Cool requested a review from hajnalmt March 30, 2026 20:14
Copy link
Copy Markdown
Member

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks,
Please check gofmt. The CI is failing

hack/verify-gofmt.sh
diff ./pkg/scheduler/plugins/proportion/proportion_test.go.orig ./pkg/scheduler/plugins/proportion/proportion_test.go
--- ./pkg/scheduler/plugins/proportion/proportion_test.go.orig
+++ ./pkg/scheduler/plugins/proportion/proportion_test.go
@@ -47,7 +47,6 @@
 	"volcano.sh/volcano/pkg/scheduler/util"
 )
 
-
 func TestMain(m *testing.M) {
 	options.Default()
 	os.Exit(m.Run())
make: *** [Makefile:255: verify] Error 1
Error: Process completed with exit code 2.

Also one of your commit message is not proper. You cannot have ats(@) in commit messages. This is not valid:
Also drop the @mention from the capacity test comment.

@Aman-Cool Aman-Cool force-pushed the fix/capacity-inqueue-double-counting branch from c2baa2a to 04f943f Compare April 1, 2026 02:33
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Apr 1, 2026
…tasks in Binding state

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
The helper was copy-pasted across proportion and capacity plugin tests.
Move it to uthelper.WaitForBinds so both packages share the same
implementation. Also drop the mention from the capacity test comment.

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
Move the double-counting explanation comments to sit directly above the
MinResources guard inside the PodGroupInqueue block, as requested in review.

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool Aman-Cool force-pushed the fix/capacity-inqueue-double-counting branch from 04f943f to 428d7b8 Compare April 1, 2026 02:34
@Aman-Cool
Copy link
Copy Markdown
Contributor Author

hey @hajnalmt, sorry for pinging you before the CI checks were completed.
removed the cursed "@" from commit messages😅

@Aman-Cool Aman-Cool requested a review from hajnalmt April 1, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

proportion plugin double-counts resources for jobs with tasks stuck in Binding, starves queue

4 participants