feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178
Open
zbennett10 wants to merge 1 commit intoNVIDIA:mainfrom
Open
feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178zbennett10 wants to merge 1 commit intoNVIDIA:mainfrom
zbennett10 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Closed
3 tasks
…licing
Wires the dcgm-exporter per-pod GPU utilization feature
(NVIDIA/dcgm-exporter#<PR>) into the ClusterPolicy CRD so GPU Operator
users can enable it with a single field instead of manually patching
DaemonSet args.
## What changes
ClusterPolicy gets a new `spec.dcgmExporter.perPodGPUUtil` stanza:
spec:
dcgmExporter:
perPodGPUUtil:
enabled: true
podResourcesSocketPath: /var/lib/kubelet/pod-resources/kubelet.sock
When enabled, the operator automatically:
- Sets DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true env var
- Mounts /var/lib/kubelet/pod-resources/ as a read-only hostPath volume
- Sets hostPID: true (required to resolve /proc/<pid>/cgroup)
## Why
Time-slicing is configured via ClusterPolicy (spec.devicePlugin.config)
but the resulting loss of per-pod GPU observability had no equivalent
ClusterPolicy lever to restore it. This closes that gap.
See: NVIDIA/dcgm-exporter#587
## Files changed
- api/nvidia/v1/clusterpolicy_types.go — DCGMExporterPerPodGPUUtilConfig
struct, PerPodGPUUtil field on DCGMExporterSpec, helper methods, constant
- api/nvidia/v1/zz_generated.deepcopy.go — deep copy for new struct
- controllers/object_controls.go — wire perPodGPUUtil into DaemonSet spec
- docs/dcgm-exporter-per-pod-gpu-metrics.md — usage + cost model
Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
e93686e to
2a834bb
Compare
Open
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
spec.dcgmExporter.perPodGPUUtilto theClusterPolicyCRD, enablingper-pod GPU SM utilization metrics when CUDA time-slicing is active.
This is the GPU Operator half of a two-part contribution that closes
NVIDIA/dcgm-exporter#638
(issue: NVIDIA/dcgm-exporter#587).
The problem
With GPU time-slicing,
dcgm_fi_dev_gpu_utilreports only aggregate deviceutilization — you cannot tell how much of the GPU proxy, embeddings, or
inference pods are each consuming.
The fix
dcgm-exporter PR #638 adds an opt-in collector that uses NVML per-process
utilization + kubelet pod-resources gRPC to emit
dcgm_fi_dev_sm_util_per_podper
(pod, namespace, container, gpu_uuid)tuple.This PR wires that feature through
ClusterPolicyso users can enable itwithout hand-editing the dcgm-exporter DaemonSet:
What GPU Operator does automatically when
enabled: trueDCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=trueenv var in the dcgm-exporter DaemonSet/var/lib/kubelet/pod-resources/as a read-onlyhostPathvolumehostPID: trueon the DaemonSet so dcgm-exporter can resolve/proc/<pid>/cgroupCompatibility
Security considerations
Enabling
perPodGPUUtilgrants dcgm-exporter:/var/lib/kubelet/pod-resources/(lists all GPU-using pods)/proc/<pid>/cgroup)These are the same permissions used by other node-level monitoring agents.