Fabric manager Shared NVSwitch virt model support#166
Fabric manager Shared NVSwitch virt model support#166mresvanis wants to merge 7 commits intoNVIDIA:masterfrom
Conversation
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
… image Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
…hen FM enabled Extract NUMA-based device selection into a standalone preferDevicesByNUMA method. When a partition manager is active, GetPreferredAllocation now delegates to it for FM-aware selection with NUMA locality; otherwise it falls back to the original NUMA-only logic. Add comprehensive tests for the FM-aware path covering partition matching, NUMA tie-breaking, error cases, unavailable GPUs, and must-include device ordering. Signed-off-by: Michail Resvanis <mresvani@redhat.com>
When the fabric manager is enabled, the Allocate handler now activates partitions for the requested device IDs before returning the allocation response, failing the request if the connection is lost or activation errors out. Signed-off-by: Michail Resvanis <mresvani@redhat.com>
|
@mresvanis I am interested in reviewing this PR, once it is ready please ping me. |
|
|
||
| log.Printf("Fabric partition activated successfully for devices: %v", allDeviceIDs) | ||
| } | ||
|
|
There was a problem hiding this comment.
In Allocate, all device IDs from all ContainerRequests are aggregated and passed in one call to ActivateForDevices. ActivateForDevices requires an exact partition-size match against that full list. So FM activation is done once on the union of all container requests, which can reject valid multi-container pod allocations. I am wondering whether union-based activation can be over-constraining for multi-container allocate requests.
if a pod has multiple GPU-consuming containers (or request splitting), allocation can fail even if each container’s assignment is valid individually.
| pciToModule, moduleToPCI, mapErr := fabricmanager.LoadPCIModuleMapping(pciModuleMappingPath) | ||
| if mapErr != nil { | ||
| log.Printf("WARNING: Failed to load PCI module mapping: %v", mapErr) | ||
| log.Print("Falling back to legacy device plugin mode") |
There was a problem hiding this comment.
This seems a FM connection lifecycle leak on startup fallback path, if connect succeeds but mapping load fails, connection is not closed.
On this mapping-fail path, fmClient is already connected, but it is never disconnected.
Below dpi.partitionManager is only assigned in the mapping-success branch, and Stop() only disconnects via dpi.partitionManager. So if connect succeeds but mapping fails, the client handle/socket stays open with no retained reference to close it.
Summary
This PR adds the following changes:
Related NVIDIA GPU Operator changes: NVIDIA/gpu-driver-container#538 and NVIDIA/gpu-operator#2045
Changes
This change adds optional Fabric Manager support behind the ENABLE_FABRIC_MANAGER environment variable (disabled by default). When enabled, the device plugin:
New packages:
Test plan