Skip to content

fix: nodes_alloc stats overcount#197

Open
cgetzen wants to merge 1 commit intoSchedMD:masterfrom
taichi-dev:fix/node-alloc-stats-overcount
Open

fix: nodes_alloc stats overcount#197
cgetzen wants to merge 1 commit intoSchedMD:masterfrom
taichi-dev:fix/node-alloc-stats-overcount

Conversation

@cgetzen
Copy link
Copy Markdown

@cgetzen cgetzen commented Apr 8, 2026

Problem

slurm_user_jobs_nodes_alloc and slurm_jobs_nodes_alloc report inflated node counts when a user runs multiple sub-node-sized jobs that shared physical nodes.

Root cause

nodes_alloc is computed by summing total_nodes across each individual running job for the user. When multiple jobs fit on the same physical node (e.g. jobs requesting only a fraction of a node's CPUs), each job contributed its own total_nodes = 1 to the sum, even though they all shared the same node.

Solution

Track node allocation using bitmaps and OR each job's node_bitmap. After all jobs have been processed, count the set bits to get the actual number of unique nodes in use.

Previously, nodes_alloc was computed by summing each job's node count,
which overcounts nodes shared by multiple jobs. Fix this by accumulating
a node_bitmap across jobs and using bit_set_count() to get the true
number of unique allocated nodes.

Changelog: Fix nodes_alloc statistics overcount caused by shared nodes being counted multiple times across jobs.
Signed-off-by: Charlie Getzen <charliegetzen@meshy.ai>
@cgetzen cgetzen force-pushed the fix/node-alloc-stats-overcount branch from dbc42f0 to 3e800a8 Compare April 8, 2026 19:47
@msteed
Copy link
Copy Markdown
Contributor

msteed commented Apr 14, 2026

Hi @cgetzen - thanks for the report and the PR. I can reproduce the failure. I've reviewed the other statistics too and it seems like these are the only two affected in this way.

I'd like to put your changes through our QA process, which involves tracking the issue in our bugzilla and creating an MR in our private gitlab instance. We'll keep your commit as-is but the QA process may result in other changes that go in as part of the complete fix. If your change is merged you'll see that reflected in the public github repo and it will ship in an upcoming release of slurm.

Does this sound okay?

@cgetzen
Copy link
Copy Markdown
Author

cgetzen commented Apr 14, 2026

Sure! I learned about the bug process after I created the PR.

https://support.schedmd.com/show_bug.cgi?id=24994 is the corresponding bug report.

Feel free to close/re-create/modify the report or PR to conform to the correct process.

Thanks! 🫡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants