[feat] : validate additional enabled drivers #2014

rahulait · 2025-12-25T05:26:51Z

Dependencies

Depends on: NVIDIA/k8s-device-plugin#1550

Description

Problem

GPU Operator supports deploying multiple driver versions within a single Kubernetes cluster through the use of multiple NvidiaDriver custom resources (CRs). However, despite supporting multiple driver instances, the GPU Operator currently deploys only a single, cluster-wide NVIDIA Container Toolkit DaemonSet and a single NVIDIA Device Plugin DaemonSet.
This architecture introduces a limitation when different NvidiaDriver CRs enable different driver-dependent features - such as GPUDirect Storage (GDS), GDRCopy, or other optional components. Because the Container Toolkit and Device Plugin are deployed once per cluster and configured uniformly, they cannot be tailored to account for feature differences across driver instances. As a result, nodes running drivers with differing enabled features cannot be correctly or independently supported.

Proposed solution

During reconciliation in the GPU Operator, we will inject additional driver-enablement environment variables into the nvidia-driver container based on the ClusterPolicy or NvidiaDriver CR selected for the node. The driver container will then persist these variables to the host filesystem on which it runs.
With this mechanism, each node will record a node-local view of enabled additional drivers, accurately reflecting the features configured for that node via its ClusterPolicy or NvidiaDriver CR.

We are updating the gpu-operator's driver validation logic where it will now wait for all enabled drivers to be installed first before proceeding.

Nvidia device-plugin is already resilient to missing devices or drivers and does not crash if a particular device is not present on the node. We are now updating device-plugin to always attempt discovery for all supported devices and driver features.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)

Testing

Unit tests (make coverage)
Manual cluster testing (describe below)
N/A or Other (docs, CI config, etc.)

Test details:
Manual testing done to validate the changes.

To test with clusterpolicy, following values.yaml was used:

driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: false
  kernelModuleType: open
  repository: nvcr.io/nvidia
  image: driver
  version: 580.105.08
  imagePullPolicy: Always
  rdma:
    enabled: false
    useHostMofed: false
gds:
  enabled: true
  repository: nvcr.io/nvidia/cloud-native
  image: nvidia-fs
  version: "2.26.6"
  imagePullPolicy: IfNotPresent
gdrcopy:
  enabled: true
  repository: nvcr.io/nvidia/cloud-native
  image: gdrdrv
  version: "v2.5.1"
  imagePullPolicy: Always
operator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd3
  imagePullPolicy: Always
devicePlugin:
  repository: docker.io/rahulsharm810
  image: k8s-device-plugin
  version: nvd2
  imagePullPolicy: Always
cdi:
  enabled: false
validator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd3
  imagePullPolicy: Always

Pods after install:

root@test:~# kgpo
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rdcrd                                1/1     Running     0          8m54s
gpu-operator-6457b8f76d-ldm8g                              1/1     Running     0          9m20s
nvidia-container-toolkit-daemonset-v72cb                   1/1     Running     0          8m54s
nvidia-cuda-validator-6hgln                                0/1     Completed   0          6m36s
nvidia-dcgm-exporter-6f86g                                 1/1     Running     0          8m54s
nvidia-device-plugin-daemonset-7pslg                       1/1     Running     0          8m54s
nvidia-driver-daemonset-kltm9                              3/3     Running     0          9m1s
nvidia-mig-manager-62vnq                                   1/1     Running     0          8m54s
nvidia-operator-validator-7fscv                            1/1     Running     0          8m54s
nvidiagpu-node-feature-discovery-gc-6d484cd547-sfgd5       1/1     Running     0          9m20s
nvidiagpu-node-feature-discovery-master-7d466cdd75-mg6nq   1/1     Running     0          9m20s
nvidiagpu-node-feature-discovery-worker-ltv95              1/1     Running     0          9m20s
root@test:~# cat /run/nvidia/driver/.additional-drivers-flags
GDRCOPY_ENABLED: true
GDS_ENABLED: true
GPU_DIRECT_RDMA_ENABLED: false
root@test:~#

Testing with nvidiadriver CR:

values.yaml file:

driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: false
operator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd2
  imagePullPolicy: Always
devicePlugin:
  repository: docker.io/rahulsharm810
  image: k8s-device-plugin
  version: nvd2
  imagePullPolicy: Always
cdi:
  enabled: false
validator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd2
  imagePullPolicy: Always

nvidiadriver CRD installed using:

kind: NVIDIADriver
metadata:
  name: demo-test
spec:
  driverType: gpu
  gdrcopy:
    enabled: true
    repository: nvcr.io/nvidia/cloud-native
    image: gdrdrv
    version: v2.5.1
    imagePullPolicy: IfNotPresent
    imagePullSecrets: []
    env: []
    args: []
  kernelModuleType: open
  rdma:
    enabled: false
    useHostMofed: false
  gds:
    enabled: false
    repository: nvcr.io/nvidia/cloud-native
    image: nvidia-fs
    version: "2.26.6"
    imagePullPolicy: IfNotPresent
  startupProbe:
    failureThreshold: 120
    initialDelaySeconds: 60
    periodSeconds: 10
    timeoutSeconds: 60
  image: driver
  repository: nvcr.io/nvidia
  imagePullPolicy: Always
  version: 580.105.08
  usePrecompiled: false

Status after install:

root@test:~# kgpo
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-j8nlg                                1/1     Running     0          34m
gpu-operator-6457b8f76d-vtzbj                              1/1     Running     0          36m
nvidia-container-toolkit-daemonset-9hzvt                   1/1     Running     0          34m
nvidia-cuda-validator-8h769                                0/1     Completed   0          33m
nvidia-dcgm-exporter-2rzzf                                 1/1     Running     0          34m
nvidia-device-plugin-daemonset-v7fzj                       1/1     Running     0          34m
nvidia-gpu-driver-ubuntu24.04-6585477fb6-c4pm2             2/2     Running     0          35m
nvidia-mig-manager-s7m5q                                   1/1     Running     0          32m
nvidia-operator-validator-4sr4t                            1/1     Running     0          34m
nvidiagpu-node-feature-discovery-gc-6d484cd547-bc8k6       1/1     Running     0          36m
nvidiagpu-node-feature-discovery-master-7d466cdd75-vfqg2   1/1     Running     0          36m
nvidiagpu-node-feature-discovery-worker-cx8r9              1/1     Running     0          36m
root@test:~# cat /run/nvidia/driver/.additional-drivers-flags
GDRCOPY_ENABLED: true
GDS_ENABLED: false
GPU_DIRECT_RDMA_ENABLED: false
root@test:~#

CDI was enabled/disabled in both the tests to make sure it works with/without CDI.

copy-pr-bot · 2025-12-25T05:26:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rahulait · 2026-01-07T22:14:22Z

/ok to test 4457fca

cmd/nvidia-validator/main.go

assets/state-driver/0500_daemonset.yaml

rahulait · 2026-01-09T19:15:22Z

/ok to test 1fefa07

cdesiniotis

One minor comment, but otherwise lgtm!

cmd/nvidia-validator/main.go

rahulait · 2026-01-13T21:38:10Z

/ok to test 0456f2b

cdesiniotis · 2026-01-13T21:52:19Z

@tariq1890 requesting your review on this.

assets/state-driver/0500_daemonset.yaml

cmd/nvidia-validator/main.go

tariq1890 · 2026-01-13T22:10:05Z

Thanks for the detailed description @rahulait! Can we also add test cases to ensure the overall coverage doesn't drop?

Changes include: * storing additional enabled drivers on the nodes itself so that container toolkit and validation pods can check to see which drivers are enabled on that node. * remove nvidia-fs and gdrcopy from driver validation, fix tests Signed-off-by: Rahul Sharma <[email protected]>

rahulait · 2026-01-14T21:25:24Z

/ok to test b430ca0

rahulait marked this pull request as ready for review December 25, 2025 05:42

rahulait requested review from ArangoGutierrez, cdesiniotis, elezar, shivamerla and tariq1890 as code owners December 25, 2025 05:42

rahulait mentioned this pull request Jan 7, 2026

remove additional-drivers-flags file on driver uninstall NVIDIA/k8s-driver-manager#147

Closed

cdesiniotis reviewed Jan 8, 2026

View reviewed changes

cmd/nvidia-validator/main.go Outdated Show resolved Hide resolved

cdesiniotis reviewed Jan 8, 2026

View reviewed changes

assets/state-driver/0500_daemonset.yaml Outdated Show resolved Hide resolved

rahulait requested a review from cdesiniotis January 13, 2026 19:49

cdesiniotis reviewed Jan 13, 2026

View reviewed changes

cmd/nvidia-validator/main.go Outdated Show resolved Hide resolved

rahulait force-pushed the validate-additional-drivers branch 2 times, most recently from 7b28c83 to 0456f2b Compare January 13, 2026 21:37

cdesiniotis approved these changes Jan 13, 2026

View reviewed changes

tariq1890 reviewed Jan 13, 2026

View reviewed changes

assets/state-driver/0500_daemonset.yaml Outdated Show resolved Hide resolved

cmd/nvidia-validator/main.go Outdated Show resolved Hide resolved

cmd/nvidia-validator/main.go Outdated Show resolved Hide resolved

tariq1890 reviewed Jan 13, 2026

View reviewed changes

cmd/nvidia-validator/main.go Outdated Show resolved Hide resolved

tariq1890 reviewed Jan 13, 2026

View reviewed changes

cmd/nvidia-validator/main.go Outdated Show resolved Hide resolved

rahulait force-pushed the validate-additional-drivers branch 3 times, most recently from f851b75 to fb3d97e Compare January 14, 2026 21:19

rahulait force-pushed the validate-additional-drivers branch from fb3d97e to b430ca0 Compare January 14, 2026 21:24

rahulait requested a review from tariq1890 January 14, 2026 21:34

tariq1890 approved these changes Jan 14, 2026

View reviewed changes

cdesiniotis approved these changes Jan 15, 2026

View reviewed changes

cdesiniotis merged commit 6504cfb into NVIDIA:main Jan 15, 2026
16 checks passed

[feat] : validate additional enabled drivers #2014

[feat] : validate additional enabled drivers #2014

Uh oh!

Conversation

rahulait commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Description

Problem

Proposed solution

Checklist

Testing

To test with clusterpolicy, following values.yaml was used:

Testing with nvidiadriver CR:

Uh oh!

copy-pr-bot bot commented Dec 25, 2025

Uh oh!

rahulait commented Jan 7, 2026

Uh oh!

Uh oh!

Uh oh!

rahulait commented Jan 9, 2026

Uh oh!

cdesiniotis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rahulait commented Jan 13, 2026

Uh oh!

cdesiniotis commented Jan 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tariq1890 commented Jan 13, 2026

Uh oh!

rahulait commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rahulait commented Dec 25, 2025 •

edited

Loading