Skip to content

fix: nodes_alloc stats overcount#197

Open
cgetzen wants to merge 1 commit intoSchedMD:masterfrom
taichi-dev:fix/node-alloc-stats-overcount
Open

fix: nodes_alloc stats overcount#197
cgetzen wants to merge 1 commit intoSchedMD:masterfrom
taichi-dev:fix/node-alloc-stats-overcount

Conversation

@cgetzen
Copy link
Copy Markdown

@cgetzen cgetzen commented Apr 8, 2026

Problem

slurm_user_jobs_nodes_alloc and slurm_jobs_nodes_alloc report inflated node counts when a user runs multiple sub-node-sized jobs that shared physical nodes.

Root cause

nodes_alloc is computed by summing total_nodes across each individual running job for the user. When multiple jobs fit on the same physical node (e.g. jobs requesting only a fraction of a node's CPUs), each job contributed its own total_nodes = 1 to the sum, even though they all shared the same node.

Solution

Track node allocation using bitmaps and OR each job's node_bitmap. After all jobs have been processed, count the set bits to get the actual number of unique nodes in use.

Previously, nodes_alloc was computed by summing each job's node count,
which overcounts nodes shared by multiple jobs. Fix this by accumulating
a node_bitmap across jobs and using bit_set_count() to get the true
number of unique allocated nodes.

Changelog: Fix nodes_alloc statistics overcount caused by shared nodes being counted multiple times across jobs.
Signed-off-by: Charlie Getzen <[email protected]>
@cgetzen cgetzen force-pushed the fix/node-alloc-stats-overcount branch from dbc42f0 to 3e800a8 Compare April 8, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant