feat: add per-pod GPU SM utilization metrics for time-slicing workloads#638
Open
zbennett10 wants to merge 7 commits intoNVIDIA:mainfrom
Open
feat: add per-pod GPU SM utilization metrics for time-slicing workloads#638zbennett10 wants to merge 7 commits intoNVIDIA:mainfrom
zbennett10 wants to merge 7 commits intoNVIDIA:mainfrom
Conversation
Closes NVIDIA#587 ## Problem When CUDA time-slicing is active (multiple pods sharing one physical GPU), `dcgm_fi_dev_gpu_util` reports aggregate device utilization — you cannot tell how much of the GPU proxy vs embeddings vs inference is consuming. ## Solution Add opt-in `ProcessPodCollector` that attributes SM utilization to individual pods by joining: 1. NVML `nvmlDeviceGetProcessUtilization()` — per-PID SM util from driver 2. Kubelet pod-resources gRPC API — maps GPU UUID → (pod, ns, container) 3. /proc/<pid>/cgroup — links NVML PIDs back to container identities ## New metric dcgm_fi_dev_sm_util_per_pod{ gpu="0", uuid="GPU-abc123", pod="synapse-proxy-...", namespace="prod", container="proxy" } 42 ## Enabling --enable-per-pod-gpu-util=true # or: DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true Requires hostPID: true (auto-set when using GPU Operator integration). ## Files changed - internal/pkg/collector/process_pod_collector.go — new collector - internal/pkg/collector/process_pod_collector_test.go — unit tests (10 cases, no GPU needed) - internal/pkg/collector/collector_factory.go — register new collector - internal/pkg/appconfig/types.go — EnablePerPodGPUUtil flag - internal/pkg/counters/const.go — DCGM_EXP_SM_UTIL_PER_POD counter name - pkg/cmd/app.go — --enable-per-pod-gpu-util CLI flag - docs/per-pod-gpu-metrics.md — usage documentation
Run TestProcessPodCollector tests in CI to validate the new process_pod_collector.go on ubuntu-latest (Linux/x86_64 — the only supported build platform for this CGo-dependent project). Also upgrade checkout/setup-go action versions and add apt gcc dep.
…atch Three compilation errors: 1. Remove 'os' stdlib import from collector_factory.go — the collector package already has 'var os osinterface.OS' in variables.go for testable os.Exit calls; the stdlib 'os' import conflicts with it. (My previous session incorrectly added the stdlib import.) 2. Alias stdlib 'os' to 'stdos' in process_pod_collector.go — ReadFile is the only stdlib os usage; the package-level 'os' variable must not be shadowed. 3. Add '...grpc.CallOption' to podResourcesClient.List() — kubelet's PodResourcesListerClient.List() includes variadic CallOption; our interface must match to satisfy the type.
… CI tests - Remove go:generate mockgen directives for nvmlDevice/nvmlLib/ podResourcesClient — mockgen cannot generate mocks for unexported interfaces from external packages, causing compilation errors. Tests use hand-coded fakes in process_pod_collector_test.go instead. - Scope 'unit-tests' CI job to TestProcessPodCollector only, excluding packages that require libdcgm.so.4 / libnvidia-ml (integration_test, nvmlprovider, server) which are pre-existing DCGM-only failures unrelated to this PR. Signed-off-by: Zachary Bennett <[email protected]>
DCGM_EXP_SM_UTIL_PER_POD is a synthetic NVML-driven metric and does not correspond to a real DCGM field, so it cannot appear in the stock dcp-metrics-included.csv. Previously NewProcessPodCollector would fail with: collector 'DCGM_EXP_SM_UTIL_PER_POD' cannot be initialized; err: counter not found in counter list Fix: define the counter inline with sensible defaults. Allow the user to override via their metrics CSV if they want custom help text. Signed-off-by: Zachary Bennett <[email protected]>
Add GetName() to nvmlDevice interface and populate GPU, UUID (label key),
GPUDevice, and GPUModelName fields on the ProcessPodCollector metric.
Previously the Prometheus output had empty label keys (="GPU-UUID...")
and empty gpu/device/modelName fields. Now emits:
DCGM_EXP_SM_UTIL_PER_POD{gpu="0",UUID="GPU-...",device="nvidia0",
modelName="NVIDIA A10G",...,pod="...",namespace="...",container="..."} 42
Update fakeNVMLDevice test double with modelName field and GetName() stub.
Update TestProcessPodCollector_EmitsMetricForSinglePod assertions.
Signed-off-by: Zachary Bennett <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #587
Summary
When CUDA time-slicing is active, multiple pods share a single physical GPU. Standard DCGM per-device metrics (
dcgm_fi_dev_gpu_util) report aggregate utilization for the whole device — you cannot tell how much of the GPU proxy, embeddings, or inference pods are each consuming.This PR adds an opt-in
ProcessPodCollectorthat attributes GPU SM utilization to individual pods by joining:nvmlDeviceGetProcessUtilization()— per-PID SM utilization sampled directly from the CUDA driver(pod, namespace, container)tuples/proc/<pid>/cgroup— links NVML PIDs back to container identitiesNew metric
One gauge is emitted per
(pod, namespace, container, gpu_uuid)tuple. The value is the NVML SM utilization percentage (0–100).Enabling
Standalone DaemonSet
With GPU Operator (v24.x+) ClusterPolicy
See companion PR in NVIDIA/gpu-operator that wires this through
ClusterPolicy-> NVIDIA/gpu-operator#2178Test plan
go test ./internal/pkg/collector/... -run TestProcessPodCollector -v) — 10 test cases, all PASS, no GPU requiredgo build ./...on ubuntu-latest (CI)dcgm_fi_dev_sm_util_per_podemitted with correctpod/namespace/containerlabels when time-slicing is active