feat: add per-pod GPU SM utilization metrics for time-slicing workloads by zbennett10 · Pull Request #638 · NVIDIA/dcgm-exporter

zbennett10 · 2026-03-02T15:26:58Z

Closes #587

Summary

When CUDA time-slicing is active, multiple pods share a single physical GPU. Standard DCGM per-device metrics (dcgm_fi_dev_gpu_util) report aggregate utilization for the whole device — you cannot tell how much of the GPU proxy, embeddings, or inference pods are each consuming.

This PR adds an opt-in ProcessPodCollector that attributes GPU SM utilization to individual pods by joining:

NVML nvmlDeviceGetProcessUtilization() — per-PID SM utilization sampled directly from the CUDA driver
Kubelet pod-resources gRPC API — maps GPU UUIDs to (pod, namespace, container) tuples
/proc/<pid>/cgroup — links NVML PIDs back to container identities

New metric

# HELP dcgm_fi_dev_sm_util_per_pod SM utilization attributed to a pod (time-slicing)
# TYPE dcgm_fi_dev_sm_util_per_pod gauge
dcgm_fi_dev_sm_util_per_pod{gpu="0",uuid="GPU-abc123",pod="synapse-proxy-...",namespace="synapse-staging",container="proxy"} 42

One gauge is emitted per (pod, namespace, container, gpu_uuid) tuple. The value is the NVML SM utilization percentage (0–100).

Enabling

Standalone DaemonSet

spec:
  template:
    spec:
      hostPID: true
      containers:
        - name: dcgm-exporter
          env:
            - name: DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL
              value: "true"
          volumeMounts:
            - name: pod-resources
              mountPath: /var/lib/kubelet/pod-resources
              readOnly: true
      volumes:
        - name: pod-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources
            type: Directory

With GPU Operator (v24.x+) ClusterPolicy

spec:
  dcgmExporter:
    perPodGPUUtil:
      enabled: true

See companion PR in NVIDIA/gpu-operator that wires this through ClusterPolicy -> NVIDIA/gpu-operator#2178

Test plan

Unit tests pass (go test ./internal/pkg/collector/... -run TestProcessPodCollector -v) — 10 test cases, all PASS, no GPU required
go build ./... on ubuntu-latest (CI)
Integration tested in our AWS environment: verified dcgm_fi_dev_sm_util_per_pod emitted with correct pod/namespace/container labels when time-slicing is active

Closes NVIDIA#587 ## Problem When CUDA time-slicing is active (multiple pods sharing one physical GPU), `dcgm_fi_dev_gpu_util` reports aggregate device utilization — you cannot tell how much of the GPU proxy vs embeddings vs inference is consuming. ## Solution Add opt-in `ProcessPodCollector` that attributes SM utilization to individual pods by joining: 1. NVML `nvmlDeviceGetProcessUtilization()` — per-PID SM util from driver 2. Kubelet pod-resources gRPC API — maps GPU UUID → (pod, ns, container) 3. /proc/<pid>/cgroup — links NVML PIDs back to container identities ## New metric dcgm_fi_dev_sm_util_per_pod{ gpu="0", uuid="GPU-abc123", pod="synapse-proxy-...", namespace="prod", container="proxy" } 42 ## Enabling --enable-per-pod-gpu-util=true # or: DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true Requires hostPID: true (auto-set when using GPU Operator integration). ## Files changed - internal/pkg/collector/process_pod_collector.go — new collector - internal/pkg/collector/process_pod_collector_test.go — unit tests (10 cases, no GPU needed) - internal/pkg/collector/collector_factory.go — register new collector - internal/pkg/appconfig/types.go — EnablePerPodGPUUtil flag - internal/pkg/counters/const.go — DCGM_EXP_SM_UTIL_PER_POD counter name - pkg/cmd/app.go — --enable-per-pod-gpu-util CLI flag - docs/per-pod-gpu-metrics.md — usage documentation

Run TestProcessPodCollector tests in CI to validate the new process_pod_collector.go on ubuntu-latest (Linux/x86_64 — the only supported build platform for this CGo-dependent project). Also upgrade checkout/setup-go action versions and add apt gcc dep.

…atch Three compilation errors: 1. Remove 'os' stdlib import from collector_factory.go — the collector package already has 'var os osinterface.OS' in variables.go for testable os.Exit calls; the stdlib 'os' import conflicts with it. (My previous session incorrectly added the stdlib import.) 2. Alias stdlib 'os' to 'stdos' in process_pod_collector.go — ReadFile is the only stdlib os usage; the package-level 'os' variable must not be shadowed. 3. Add '...grpc.CallOption' to podResourcesClient.List() — kubelet's PodResourcesListerClient.List() includes variadic CallOption; our interface must match to satisfy the type.

… CI tests - Remove go:generate mockgen directives for nvmlDevice/nvmlLib/ podResourcesClient — mockgen cannot generate mocks for unexported interfaces from external packages, causing compilation errors. Tests use hand-coded fakes in process_pod_collector_test.go instead. - Scope 'unit-tests' CI job to TestProcessPodCollector only, excluding packages that require libdcgm.so.4 / libnvidia-ml (integration_test, nvmlprovider, server) which are pre-existing DCGM-only failures unrelated to this PR. Signed-off-by: Zachary Bennett <[email protected]>

DCGM_EXP_SM_UTIL_PER_POD is a synthetic NVML-driven metric and does not correspond to a real DCGM field, so it cannot appear in the stock dcp-metrics-included.csv. Previously NewProcessPodCollector would fail with: collector 'DCGM_EXP_SM_UTIL_PER_POD' cannot be initialized; err: counter not found in counter list Fix: define the counter inline with sensible defaults. Allow the user to override via their metrics CSV if they want custom help text. Signed-off-by: Zachary Bennett <[email protected]>

Add GetName() to nvmlDevice interface and populate GPU, UUID (label key), GPUDevice, and GPUModelName fields on the ProcessPodCollector metric. Previously the Prometheus output had empty label keys (="GPU-UUID...") and empty gpu/device/modelName fields. Now emits: DCGM_EXP_SM_UTIL_PER_POD{gpu="0",UUID="GPU-...",device="nvidia0", modelName="NVIDIA A10G",...,pod="...",namespace="...",container="..."} 42 Update fakeNVMLDevice test double with modelName field and GetName() stub. Update TestProcessPodCollector_EmitsMetricForSinglePod assertions. Signed-off-by: Zachary Bennett <[email protected]>

zbennett10 added 7 commits March 1, 2026 19:22

ci: trigger workflow on feat/** branches for pre-PR validation

363155d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add per-pod GPU SM utilization metrics for time-slicing workloads#638

feat: add per-pod GPU SM utilization metrics for time-slicing workloads#638
zbennett10 wants to merge 7 commits intoNVIDIA:mainfrom
zbennett10:feat/per-pod-gpu-util-time-slicing

zbennett10 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbennett10 commented Mar 2, 2026

Summary

New metric

Enabling

Standalone DaemonSet

With GPU Operator (v24.x+) ClusterPolicy

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant