Skip to content

Fabric manager Shared NVSwitch virt model support#166

Draft
mresvanis wants to merge 7 commits intoNVIDIA:masterfrom
mresvanis:fabric-manager-support
Draft

Fabric manager Shared NVSwitch virt model support#166
mresvanis wants to merge 7 commits intoNVIDIA:masterfrom
mresvanis:fabric-manager-support

Conversation

@mresvanis
Copy link

Summary

This PR adds the following changes:

  • Add NVIDIA Fabric Manager integration for multi-GPU NVSwitch-based systems (e.g., DGX/HGX), enabling automatic fabric partition management during device allocation
  • Introduce CGO bindings for libnvfm and a partition manager that coordinates GPU grouping via NVLink fabric partitions
  • Refactor GetPreferredAllocation to prefer devices belonging to the same fabric partition when FM is enabled, falling back to NUMA-based selection otherwise

Related NVIDIA GPU Operator changes: NVIDIA/gpu-driver-container#538 and NVIDIA/gpu-operator#2045

Changes

This change adds optional Fabric Manager support behind the ENABLE_FABRIC_MANAGER environment variable (disabled by default). When enabled, the device plugin:

  1. Connects to the FM daemon over a Unix socket at startup
  2. Uses a PCI-to-module mapping to resolve GPU physical IDs to FM module IDs
  3. Selects preferred allocations that align with FM partition boundaries and NUMA locality
  4. Activates the appropriate fabric partition during Allocate, ensuring NVLink connectivity between allocated GPUs

New packages:

  • pkg/nvfm -- CGO bindings for the libnvfm shared library
  • pkg/fabricmanager -- High-level FM client, partition manager, and PCI module mapping utilities

Test plan

  • Unit tests added for pkg/fabricmanager (client, partition manager) and pkg/device_plugin
  • Verify device plugin starts and operates normally with ENABLE_FABRIC_MANAGER=false (default)
  • Verify FM partition activation on an NVSwitch node with ENABLE_FABRIC_MANAGER=true

Signed-off-by: Michail Resvanis <mresvani@redhat.com>
… image

Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
…hen FM enabled

Extract NUMA-based device selection into a standalone preferDevicesByNUMA
method. When a partition manager is active, GetPreferredAllocation now
delegates to it for FM-aware selection with NUMA locality; otherwise it
falls back to the original NUMA-only logic. Add comprehensive tests for
the FM-aware path covering partition matching, NUMA tie-breaking, error
cases, unavailable GPUs, and must-include device ordering.

Signed-off-by: Michail Resvanis <mresvani@redhat.com>
When the fabric manager is enabled, the Allocate handler now
activates partitions for the requested device IDs before
returning the allocation response, failing the request if the
connection is lost or activation errors out.

Signed-off-by: Michail Resvanis <mresvani@redhat.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@alaypatel07
Copy link

@mresvanis I am interested in reviewing this PR, once it is ready please ping me.


log.Printf("Fabric partition activated successfully for devices: %v", allDeviceIDs)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Allocate, all device IDs from all ContainerRequests are aggregated and passed in one call to ActivateForDevices. ActivateForDevices requires an exact partition-size match against that full list. So FM activation is done once on the union of all container requests, which can reject valid multi-container pod allocations. I am wondering whether union-based activation can be over-constraining for multi-container allocate requests.
if a pod has multiple GPU-consuming containers (or request splitting), allocation can fail even if each container’s assignment is valid individually.

pciToModule, moduleToPCI, mapErr := fabricmanager.LoadPCIModuleMapping(pciModuleMappingPath)
if mapErr != nil {
log.Printf("WARNING: Failed to load PCI module mapping: %v", mapErr)
log.Print("Falling back to legacy device plugin mode")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a FM connection lifecycle leak on startup fallback path, if connect succeeds but mapping load fails, connection is not closed.
On this mapping-fail path, fmClient is already connected, but it is never disconnected.
Below dpi.partitionManager is only assigned in the mapping-success branch, and Stop() only disconnects via dpi.partitionManager. So if connect succeeds but mapping fails, the client handle/socket stays open with no retained reference to close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants