Open
Conversation
Previously, nodes_alloc was computed by summing each job's node count, which overcounts nodes shared by multiple jobs. Fix this by accumulating a node_bitmap across jobs and using bit_set_count() to get the true number of unique allocated nodes. Changelog: Fix nodes_alloc statistics overcount caused by shared nodes being counted multiple times across jobs. Signed-off-by: Charlie Getzen <[email protected]>
dbc42f0 to
3e800a8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
slurm_user_jobs_nodes_allocandslurm_jobs_nodes_allocreport inflated node counts when a user runs multiple sub-node-sized jobs that shared physical nodes.Root cause
nodes_allocis computed by summingtotal_nodesacross each individual running job for the user. When multiple jobs fit on the same physical node (e.g. jobs requesting only a fraction of a node's CPUs), each job contributed its owntotal_nodes = 1to the sum, even though they all shared the same node.Solution
Track node allocation using bitmaps and OR each job's
node_bitmap. After all jobs have been processed, count the set bits to get the actual number of unique nodes in use.