feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy by zbennett10 · Pull Request #2178 · NVIDIA/gpu-operator

zbennett10 · 2026-03-02T02:07:58Z

Summary

Adds spec.dcgmExporter.perPodGPUUtil to the ClusterPolicy CRD, enabling
per-pod GPU SM utilization metrics when CUDA time-slicing is active.

This is the GPU Operator half of a two-part contribution that closes
NVIDIA/dcgm-exporter#638
(issue: NVIDIA/dcgm-exporter#587).

The problem

With GPU time-slicing, dcgm_fi_dev_gpu_util reports only aggregate device
utilization — you cannot tell how much of the GPU proxy, embeddings, or
inference pods are each consuming.

The fix

dcgm-exporter PR #638 adds an opt-in collector that uses NVML per-process
utilization + kubelet pod-resources gRPC to emit dcgm_fi_dev_sm_util_per_pod
per (pod, namespace, container, gpu_uuid) tuple.

This PR wires that feature through ClusterPolicy so users can enable it
without hand-editing the dcgm-exporter DaemonSet:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  dcgmExporter:
    perPodGPUUtil:
      enabled: true
      # podResourcesSocketPath defaults to /var/lib/kubelet/pod-resources/kubelet.sock

What GPU Operator does automatically when `enabled: true`

Sets DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true env var in the dcgm-exporter DaemonSet
Mounts /var/lib/kubelet/pod-resources/ as a read-only hostPath volume
Sets hostPID: true on the DaemonSet so dcgm-exporter can resolve /proc/<pid>/cgroup

Compatibility

GPU Operator	dcgm-exporter	Feature available
< v24.x	any	No
≥ v24.x	< v3.4.0	Field accepted but no-op
≥ v24.x	≥ v3.4.0	Yes

Security considerations

Enabling perPodGPUUtil grants dcgm-exporter:

Read access to /var/lib/kubelet/pod-resources/ (lists all GPU-using pods)
Host PID namespace access (to read /proc/<pid>/cgroup)

These are the same permissions used by other node-level monitoring agents.

copy-pr-bot · 2026-03-02T02:08:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…licing Wires the dcgm-exporter per-pod GPU utilization feature (NVIDIA/dcgm-exporter#<PR>) into the ClusterPolicy CRD so GPU Operator users can enable it with a single field instead of manually patching DaemonSet args. ## What changes ClusterPolicy gets a new `spec.dcgmExporter.perPodGPUUtil` stanza: spec: dcgmExporter: perPodGPUUtil: enabled: true podResourcesSocketPath: /var/lib/kubelet/pod-resources/kubelet.sock When enabled, the operator automatically: - Sets DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true env var - Mounts /var/lib/kubelet/pod-resources/ as a read-only hostPath volume - Sets hostPID: true (required to resolve /proc/<pid>/cgroup) ## Why Time-slicing is configured via ClusterPolicy (spec.devicePlugin.config) but the resulting loss of per-pod GPU observability had no equivalent ClusterPolicy lever to restore it. This closes that gap. See: NVIDIA/dcgm-exporter#587 ## Files changed - api/nvidia/v1/clusterpolicy_types.go — DCGMExporterPerPodGPUUtilConfig struct, PerPodGPUUtil field on DCGMExporterSpec, helper methods, constant - api/nvidia/v1/zz_generated.deepcopy.go — deep copy for new struct - controllers/object_controls.go — wire perPodGPUUtil into DaemonSet spec - docs/dcgm-exporter-per-pod-gpu-metrics.md — usage + cost model Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>

zbennett10 requested review from cdesiniotis, karthikvetrivel, rahulait, rajathagasthya, shivamerla and tariq1890 as code owners March 2, 2026 02:07

zbennett10 marked this pull request as draft March 2, 2026 12:46

zbennett10 marked this pull request as ready for review March 2, 2026 14:46

zbennett10 mentioned this pull request Mar 2, 2026

feat: add per-pod GPU SM utilization metrics for time-slicing workloads NVIDIA/dcgm-exporter#637

Closed

3 tasks

zbennett10 force-pushed the feat/dcgm-exporter-per-pod-gpu-util branch from e93686e to 2a834bb Compare March 2, 2026 15:18

zbennett10 mentioned this pull request Mar 2, 2026

feat: add per-pod GPU SM utilization metrics for time-slicing workloads NVIDIA/dcgm-exporter#638

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178

feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178
zbennett10 wants to merge 1 commit intoNVIDIA:mainfrom
zbennett10:feat/dcgm-exporter-per-pod-gpu-util

zbennett10 commented Mar 2, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbennett10 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The problem

The fix

What GPU Operator does automatically when enabled: true

Compatibility

Security considerations

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zbennett10 commented Mar 2, 2026 •

edited

Loading

What GPU Operator does automatically when `enabled: true`