fix: nodes_alloc stats overcount#197
Conversation
Previously, nodes_alloc was computed by summing each job's node count, which overcounts nodes shared by multiple jobs. Fix this by accumulating a node_bitmap across jobs and using bit_set_count() to get the true number of unique allocated nodes. Changelog: Fix nodes_alloc statistics overcount caused by shared nodes being counted multiple times across jobs. Signed-off-by: Charlie Getzen <charliegetzen@meshy.ai>
dbc42f0 to
3e800a8
Compare
|
Hi @cgetzen - thanks for the report and the PR. I can reproduce the failure. I've reviewed the other statistics too and it seems like these are the only two affected in this way. I'd like to put your changes through our QA process, which involves tracking the issue in our bugzilla and creating an MR in our private gitlab instance. We'll keep your commit as-is but the QA process may result in other changes that go in as part of the complete fix. If your change is merged you'll see that reflected in the public github repo and it will ship in an upcoming release of slurm. Does this sound okay? |
|
Sure! I learned about the bug process after I created the PR. https://support.schedmd.com/show_bug.cgi?id=24994 is the corresponding bug report. Feel free to close/re-create/modify the report or PR to conform to the correct process. Thanks! 🫡 |
Problem
slurm_user_jobs_nodes_allocandslurm_jobs_nodes_allocreport inflated node counts when a user runs multiple sub-node-sized jobs that shared physical nodes.Root cause
nodes_allocis computed by summingtotal_nodesacross each individual running job for the user. When multiple jobs fit on the same physical node (e.g. jobs requesting only a fraction of a node's CPUs), each job contributed its owntotal_nodes = 1to the sum, even though they all shared the same node.Solution
Track node allocation using bitmaps and OR each job's
node_bitmap. After all jobs have been processed, count the set bits to get the actual number of unique nodes in use.