Skip to content

feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178

Open
zbennett10 wants to merge 1 commit intoNVIDIA:mainfrom
zbennett10:feat/dcgm-exporter-per-pod-gpu-util
Open

feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178
zbennett10 wants to merge 1 commit intoNVIDIA:mainfrom
zbennett10:feat/dcgm-exporter-per-pod-gpu-util

Conversation

@zbennett10
Copy link

@zbennett10 zbennett10 commented Mar 2, 2026

Summary

Adds spec.dcgmExporter.perPodGPUUtil to the ClusterPolicy CRD, enabling
per-pod GPU SM utilization metrics when CUDA time-slicing is active.

This is the GPU Operator half of a two-part contribution that closes
NVIDIA/dcgm-exporter#638
(issue: NVIDIA/dcgm-exporter#587).

The problem

With GPU time-slicing, dcgm_fi_dev_gpu_util reports only aggregate device
utilization — you cannot tell how much of the GPU proxy, embeddings, or
inference pods are each consuming.

The fix

dcgm-exporter PR #638 adds an opt-in collector that uses NVML per-process
utilization + kubelet pod-resources gRPC to emit dcgm_fi_dev_sm_util_per_pod
per (pod, namespace, container, gpu_uuid) tuple.

This PR wires that feature through ClusterPolicy so users can enable it
without hand-editing the dcgm-exporter DaemonSet:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  dcgmExporter:
    perPodGPUUtil:
      enabled: true
      # podResourcesSocketPath defaults to /var/lib/kubelet/pod-resources/kubelet.sock

What GPU Operator does automatically when enabled: true

  1. Sets DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true env var in the dcgm-exporter DaemonSet
  2. Mounts /var/lib/kubelet/pod-resources/ as a read-only hostPath volume
  3. Sets hostPID: true on the DaemonSet so dcgm-exporter can resolve /proc/<pid>/cgroup

Compatibility

GPU Operator dcgm-exporter Feature available
< v24.x any No
≥ v24.x < v3.4.0 Field accepted but no-op
≥ v24.x ≥ v3.4.0 Yes

Security considerations

Enabling perPodGPUUtil grants dcgm-exporter:

  • Read access to /var/lib/kubelet/pod-resources/ (lists all GPU-using pods)
  • Host PID namespace access (to read /proc/<pid>/cgroup)

These are the same permissions used by other node-level monitoring agents.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@zbennett10 zbennett10 marked this pull request as draft March 2, 2026 12:46
@zbennett10 zbennett10 marked this pull request as ready for review March 2, 2026 14:46
…licing

Wires the dcgm-exporter per-pod GPU utilization feature
(NVIDIA/dcgm-exporter#<PR>) into the ClusterPolicy CRD so GPU Operator
users can enable it with a single field instead of manually patching
DaemonSet args.

## What changes

ClusterPolicy gets a new `spec.dcgmExporter.perPodGPUUtil` stanza:

    spec:
      dcgmExporter:
        perPodGPUUtil:
          enabled: true
          podResourcesSocketPath: /var/lib/kubelet/pod-resources/kubelet.sock

When enabled, the operator automatically:
- Sets DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true env var
- Mounts /var/lib/kubelet/pod-resources/ as a read-only hostPath volume
- Sets hostPID: true (required to resolve /proc/<pid>/cgroup)

## Why

Time-slicing is configured via ClusterPolicy (spec.devicePlugin.config)
but the resulting loss of per-pod GPU observability had no equivalent
ClusterPolicy lever to restore it. This closes that gap.

See: NVIDIA/dcgm-exporter#587

## Files changed

- api/nvidia/v1/clusterpolicy_types.go — DCGMExporterPerPodGPUUtilConfig
  struct, PerPodGPUUtil field on DCGMExporterSpec, helper methods, constant
- api/nvidia/v1/zz_generated.deepcopy.go — deep copy for new struct
- controllers/object_controls.go — wire perPodGPUUtil into DaemonSet spec
- docs/dcgm-exporter-per-pod-gpu-metrics.md — usage + cost model

Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant