-
Notifications
You must be signed in to change notification settings - Fork 441
[feat] : validate additional enabled drivers #2014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
Author
|
/ok to test 4457fca |
cdesiniotis
reviewed
Jan 8, 2026
cdesiniotis
reviewed
Jan 8, 2026
Contributor
Author
|
/ok to test 1fefa07 |
cdesiniotis
reviewed
Jan 13, 2026
Contributor
cdesiniotis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment, but otherwise lgtm!
7b28c83 to
0456f2b
Compare
Contributor
Author
|
/ok to test 0456f2b |
cdesiniotis
approved these changes
Jan 13, 2026
Contributor
|
@tariq1890 requesting your review on this. |
tariq1890
reviewed
Jan 13, 2026
tariq1890
reviewed
Jan 13, 2026
tariq1890
reviewed
Jan 13, 2026
Contributor
|
Thanks for the detailed description @rahulait! Can we also add test cases to ensure the overall coverage doesn't drop? |
f851b75 to
fb3d97e
Compare
Changes include: * storing additional enabled drivers on the nodes itself so that container toolkit and validation pods can check to see which drivers are enabled on that node. * remove nvidia-fs and gdrcopy from driver validation, fix tests Signed-off-by: Rahul Sharma <[email protected]>
fb3d97e to
b430ca0
Compare
Contributor
Author
|
/ok to test b430ca0 |
tariq1890
approved these changes
Jan 14, 2026
cdesiniotis
approved these changes
Jan 15, 2026
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dependencies
Depends on: NVIDIA/k8s-device-plugin#1550
Description
Problem
GPU Operator supports deploying multiple driver versions within a single Kubernetes cluster through the use of multiple NvidiaDriver custom resources (CRs). However, despite supporting multiple driver instances, the GPU Operator currently deploys only a single, cluster-wide NVIDIA Container Toolkit DaemonSet and a single NVIDIA Device Plugin DaemonSet.
This architecture introduces a limitation when different NvidiaDriver CRs enable different driver-dependent features - such as GPUDirect Storage (GDS), GDRCopy, or other optional components. Because the Container Toolkit and Device Plugin are deployed once per cluster and configured uniformly, they cannot be tailored to account for feature differences across driver instances. As a result, nodes running drivers with differing enabled features cannot be correctly or independently supported.
Proposed solution
During reconciliation in the GPU Operator, we will inject additional driver-enablement environment variables into the nvidia-driver container based on the ClusterPolicy or NvidiaDriver CR selected for the node. The driver container will then persist these variables to the host filesystem on which it runs.
With this mechanism, each node will record a node-local view of enabled additional drivers, accurately reflecting the features configured for that node via its ClusterPolicy or NvidiaDriver CR.
We are updating the gpu-operator's driver validation logic where it will now wait for all enabled drivers to be installed first before proceeding.
Nvidia device-plugin is already resilient to missing devices or drivers and does not crash if a particular device is not present on the node. We are now updating device-plugin to always attempt discovery for all supported devices and driver features.
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
make coverage)Test details:
Manual testing done to validate the changes.
To test with clusterpolicy, following values.yaml was used:
Pods after install:
Testing with nvidiadriver CR:
values.yaml file:
nvidiadriver CRD installed using:
Status after install:
CDI was enabled/disabled in both the tests to make sure it works with/without CDI.