Skip to content

Commit 40b0d16

Browse files
committed
[dcgm][dcgm-exporter] add liveness and readiness probes
This commit adds liveness and readiness probes to the dcgm and dcgm-exporter operands. Adding probes to the DCGM pods ensure that these pods aren't marked as "Ready" until the DCGM is actually ready to serve traffic. The DCGM-Exporter probes have been taken from the default probes configured in the helm chart of the NVIDIA/dcgm-exporter project. Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
1 parent 434b9c4 commit 40b0d16

File tree

2 files changed

+19
-0
lines changed

2 files changed

+19
-0
lines changed

assets/state-dcgm-exporter/0800_daemonset.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,17 @@ spec:
5252
ports:
5353
- name: "metrics"
5454
containerPort: 9400
55+
livenessProbe:
56+
httpGet:
57+
port: 9400
58+
path: /health
59+
initialDelaySeconds: 45
60+
periodSeconds: 5
61+
readinessProbe:
62+
httpGet:
63+
port: 9400
64+
path: /health
65+
initialDelaySeconds: 45
5566
volumeMounts:
5667
- name: "pod-gpu-resources"
5768
readOnly: true

assets/state-dcgm/0400_dcgm.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,14 @@ spec:
4343
ports:
4444
- name: "dcgm"
4545
containerPort: 5555
46+
livenessProbe:
47+
tcpSocket:
48+
port: 5555
49+
initialDelaySeconds: 15
50+
readinessProbe:
51+
tcpSocket:
52+
port: 5555
53+
initialDelaySeconds: 15
4654
volumes:
4755
- name: run-nvidia
4856
hostPath:

0 commit comments

Comments
 (0)