numaaware: add GPU NUMA topology awareness to scheduler by pmady · Pull Request #5095 · volcano-sh/volcano

pmady · 2026-03-11T01:11:04Z

What type of PR is this?

/kind feature

What this PR does / why we need it

This PR extends the numaaware scheduler plugin to support GPU NUMA topology awareness. Currently, the plugin only considers CPU NUMA topology when scheduling workloads. For GPU workloads (especially multi-GPU training and LLM inference), cross-NUMA GPU placement causes measurable performance degradation due to NVLink and PCIe traffic crossing NUMA boundaries.

Changes

1. API: Add GPU topology types (pkg/scheduler/api/numa_info.go)

Add GPUInfo and GPUDetails types for GPU-to-NUMA node mapping
Add NUMANodes() and GPUsInNUMANodes() helper methods on GPUDetails
Add GPUDetail field to NumatopoInfo struct
Update DeepCopy to include GPU topology data

2. GPU HintProvider (pkg/scheduler/plugins/numaaware/provider/gpumanager/)

Implement policy.HintProvider interface for nvidia.com/gpu resources
Generate topology hints based on GPU-to-NUMA node affinity
Prefer allocations from the minimum number of NUMA nodes
Allocate GPUs from preferred NUMA nodes first, then spill over
Follows the same pattern as the existing cpumanager provider

3. Plugin registration and scoring (pkg/scheduler/plugins/numaaware/numaaware.go)

Register gpumanager.NewProvider() alongside existing cpumanager.NewProvider()
Update getNodeNumaNumForTask scoring to include GPU NUMA node count
Add getNumaNodeCntForGPUID helper function

Example scenario (from #4998)

Node A: 8 GPUs, GPUs 0-3 on NUMA 0, GPUs 4-7 on NUMA 1, available: GPUs 0-2 on NUMA 0 + GPU 7 on NUMA 1
Node B: 8 GPUs, GPUs 0-3 on NUMA 0, GPUs 4-7 on NUMA 1, available: GPUs 0-3 all on NUMA 0

For a task requesting 4 GPUs:

Without this PR: scheduler might pick Node A, GPUs span 2 NUMA nodes
With this PR: scheduler prefers Node B, all 4 GPUs from single NUMA node

Which issue(s) this PR fixes

Fixes #4998

Does this PR introduce a user-facing change?

Add GPU NUMA topology awareness to the numaaware scheduler plugin. The scheduler now considers GPU-to-NUMA node affinity when scheduling GPU workloads, preferring placements that minimize cross-NUMA GPU allocation for better performance.

How Has This Been Tested?

Unit tests for GPU HintProvider: topology hints and allocation (10 test cases)
Verified all existing cpumanager and policy tests still pass
go build ./pkg/scheduler/... passes cleanly

Add GPUInfo, GPUDetails types and helper methods (NUMANodes, GPUsInNUMANodes) to support GPU-to-NUMA node mapping in the scheduler API. Extend NumatopoInfo struct with GPUDetail field and update DeepCopy to include GPU topology data. This enables the numaaware plugin to consider GPU PCIe affinity when making scheduling decisions for GPU workloads. Signed-off-by: pmady <pavan4devops@gmail.com>

Add gpumanager package that implements the policy.HintProvider interface for nvidia.com/gpu resources. The provider: - Generates topology hints based on GPU-to-NUMA node affinity - Prefers allocations from the minimum number of NUMA nodes - Allocates GPUs from preferred NUMA nodes first, then spills over - Follows the same pattern as the existing cpumanager provider Signed-off-by: pmady <pavan4devops@gmail.com>

Register the gpumanager HintProvider alongside the existing cpumanager provider so that GPU NUMA topology is considered during predicate and allocation phases. Update getNodeNumaNumForTask to include GPU NUMA node count in the scoring function. Nodes where both CPU and GPU assignments span fewer NUMA nodes receive higher scores, improving locality for mixed CPU+GPU workloads. Add getNumaNodeCntForGPUID helper to count NUMA nodes used by GPU assignments. Signed-off-by: pmady <pavan4devops@gmail.com>

Test coverage includes: - 4 GPUs from single NUMA node (both NUMA 0 and NUMA 1) - 4 GPUs when only one NUMA node has enough available - 5 GPUs requiring cross-NUMA allocation - 8 GPUs using all available across both NUMA nodes - 9 GPUs requested (impossible, no hints returned) - Container with no GPU request (nil hints) - Allocation from preferred NUMA node - Allocation with partial NUMA availability All tests follow existing table-driven test pattern from cpumanager. Signed-off-by: pmady <pavan4devops@gmail.com>

volcano-sh-bot · 2026-03-11T01:11:16Z

Welcome @pmady! It looks like this is your first PR to volcano-sh/volcano 🎉

gemini-code-assist · 2026-03-11T01:11:23Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the numaaware scheduler plugin by integrating GPU NUMA topology awareness. Previously, the plugin only considered CPU NUMA topology, which could lead to suboptimal GPU placements for performance-critical workloads. By incorporating GPU-to-NUMA node affinity into the scheduling decisions, the system can now intelligently place GPU-intensive tasks on nodes that minimize cross-NUMA communication, thereby improving performance for multi-GPU training and LLM inference.

Highlights

GPU Topology API Extension: New API types (GPUInfo, GPUDetails) and helper methods were introduced to represent GPU-to-NUMA node topology, enhancing the scheduler's understanding of hardware layout.
GPU HintProvider Implementation: A dedicated GPU HintProvider (gpumanager) was implemented to generate topology hints and allocate GPUs based on NUMA affinity, prioritizing placements that minimize cross-NUMA communication.
Scheduler Scoring Update: The numaaware scheduler plugin was updated to register the new GPU HintProvider and modify its scoring logic to factor in GPU NUMA node count alongside CPU NUMA count for more optimal task placement.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pkg/scheduler/api/numa_info.go
- Added GPUInfo and GPUDetails types for GPU topology information.
- Added NUMANodes() and GPUsInNUMANodes() helper methods to GPUDetails.
- Included a GPUDetail field in the NumatopoInfo struct.
- Updated the DeepCopy method to correctly handle GPUDetail.
pkg/scheduler/plugins/numaaware/numaaware.go
- Imported the new gpumanager package.
- Registered gpumanager.NewProvider() alongside the existing CPU manager provider.
- Modified getNodeNumaNumForTask to calculate and include GPU NUMA node count in the scoring logic.
- Added a new helper function getNumaNodeCntForGPUID to count NUMA nodes for assigned GPUs.
pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go
- Created a new file implementing the gpumanager provider.
- Implemented the policy.HintProvider interface for nvidia.com/gpu resources.
- Provided GetTopologyHints to generate NUMA-aware GPU placement hints.
- Implemented Allocate to assign GPUs based on the best topology hint, prioritizing aligned GPUs.
pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng_test.go
- Created a new file for unit tests of the gpumanager provider.
- Included test cases for GetTopologyHints to verify correct hint generation.
- Included test cases for Allocate to verify correct GPU assignment based on hints.

Activity

No human activity (comments, reviews) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces GPU NUMA topology awareness to the numaaware scheduler plugin, a valuable feature for performance-sensitive GPU workloads. While the changes are well-structured and follow existing patterns, two medium-severity security issues were identified: a potential Denial of Service (DoS) due to exponential complexity in bitmask iteration with many NUMA nodes, and a potential integer overflow when handling large GPU resource requests. These require addressing with appropriate input validation and consistent data types. Additionally, a significant issue in the scoring logic could lead to suboptimal scheduling decisions, and there are a few minor suggestions for improvement.

gemini-code-assist · 2026-03-11T01:14:37Z

pkg/scheduler/plugins/numaaware/numaaware.go

+		cpuNumaCnt := getNumaNodeCntForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)
+
+		// Include GPU NUMA node count in scoring if GPU assignments exist.
+		gpuNumaCnt := 0
+		assignGPUs := resAssignMap[node.Name][string(gpumanager.NvidiaGPUResource)]
+		if assignGPUs.Size() > 0 && node.NumaSchedulerInfo.GPUDetail != nil {
+			gpuNumaCnt = getNumaNodeCntForGPUID(assignGPUs, node.NumaSchedulerInfo.GPUDetail)
+		}
+
+		// Total NUMA node count: prefer nodes where both CPU and GPU
+		// assignments span fewer NUMA nodes.
+		totalNumaCnt := cpuNumaCnt + gpuNumaCnt
 		nodeNumaCnts[index] = api.ScoredNode{
 			NodeName: node.Name,
-			Score:    int64(getNumaNodeCntForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)),
+			Score:    int64(totalNumaCnt),
 		}


The current scoring logic totalNumaCnt := cpuNumaCnt + gpuNumaCnt incorrectly calculates the total number of NUMA nodes used. It sums the counts of NUMA nodes for CPUs and GPUs separately, which doesn't correctly represent the total span of NUMA nodes when resources share a NUMA node. For example, if both CPUs and GPUs are on NUMA node 0, the current logic would score it as 1+1=2, which is the same score as when CPUs are on NUMA 0 and GPUs are on NUMA 1. The score should be the count of the union of NUMA nodes used by both CPUs and GPUs.

To fix this, you should calculate the union of NUMA node bitmasks for CPUs and GPUs first, and then count the bits in the resulting mask for the score. This will require changing getNumaNodeCntForGPUID to return a bitmask.BitMask, and also modifying getNumaNodeCntForCPUID (or creating a new helper) to return a bitmask.

Suggested change

cpuNumaCnt := getNumaNodeCntForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)

// Include GPU NUMA node count in scoring if GPU assignments exist.

gpuNumaCnt := 0

assignGPUs := resAssignMap[node.Name][string(gpumanager.NvidiaGPUResource)]

if assignGPUs.Size() > 0 && node.NumaSchedulerInfo.GPUDetail != nil {

gpuNumaCnt = getNumaNodeCntForGPUID(assignGPUs, node.NumaSchedulerInfo.GPUDetail)

}

// Total NUMA node count: prefer nodes where both CPU and GPU

// assignments span fewer NUMA nodes.

totalNumaCnt := cpuNumaCnt + gpuNumaCnt

nodeNumaCnts[index] = api.ScoredNode{

NodeName: node.Name,

Score: int64(getNumaNodeCntForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)),

Score: int64(totalNumaCnt),

}

cpuNumaMask := getNumaMaskForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)

// Include GPU NUMA node count in scoring if GPU assignments exist.

gpuNumaMask, _ := bitmask.NewBitMask()

assignGPUs := resAssignMap[node.Name][string(gpumanager.NvidiaGPUResource)]

if assignGPUs.Size() > 0 && node.NumaSchedulerInfo.GPUDetail != nil {

gpuNumaMask = getNumaMaskForGPUID(assignGPUs, node.NumaSchedulerInfo.GPUDetail)

}

// Total NUMA node count: prefer nodes where both CPU and GPU

// assignments span the minimum number of unique NUMA nodes.

totalNumaMask := cpuNumaMask.Or(gpuNumaMask)

nodeNumaCnts[index] = api.ScoredNode{

NodeName: node.Name,

Score: int64(totalNumaMask.Count()),

}

gemini-code-assist · 2026-03-11T01:14:37Z

pkg/scheduler/plugins/numaaware/numaaware.go

+func getNumaNodeCntForGPUID(gpus cpuset.CPUSet, gpuDetails api.GPUDetails) int {
+	mask, _ := bitmask.NewBitMask()
+	for _, gpuIdx := range gpus.List() {
+		if info, ok := gpuDetails[gpuIdx]; ok {
+			mask.Add(info.NUMANodeID)
+		}
+	}
+	return mask.Count()
+}


To implement the corrected scoring logic as suggested in the other comment, this function should be renamed to getNumaMaskForGPUID and return a bitmask.BitMask instead of an int.

Suggested change

func getNumaNodeCntForGPUID(gpus cpuset.CPUSet, gpuDetails api.GPUDetails) int {

mask, _ := bitmask.NewBitMask()

for _, gpuIdx := range gpus.List() {

if info, ok := gpuDetails[gpuIdx]; ok {

mask.Add(info.NUMANodeID)

}

}

return mask.Count()

}

func getNumaMaskForGPUID(gpus cpuset.CPUSet, gpuDetails api.GPUDetails) bitmask.BitMask {

mask, _ := bitmask.NewBitMask()

for _, gpuIdx := range gpus.List() {

if info, ok := gpuDetails[gpuIdx]; ok {

mask.Add(info.NUMANodeID)

}

}

return mask

}

gemini-code-assist · 2026-03-11T01:14:37Z

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go

+	minAffinitySize := gpuDetail.NUMANodes().Size()
+	hints := []policy.TopologyHint{}
+
+	bitmask.IterateBitMasks(gpuDetail.NUMANodes().List(), func(mask bitmask.BitMask) {


The generateGPUTopologyHints function calls bitmask.IterateBitMasks with a list of NUMA node IDs. This function iterates over all possible non-empty combinations of the provided NUMA nodes, which has a complexity of $2^N - 1$, where $N$ is the number of unique NUMA nodes. If a node reports a large number of NUMA nodes (e.g., due to misconfiguration or a compromised node agent), the scheduler will consume excessive CPU time, leading to a Denial of Service (DoS). While the number of NUMA nodes is typically small, the lack of validation on this input makes the scheduler vulnerable to resource exhaustion.

gemini-code-assist · 2026-03-11T01:14:37Z

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go

+	if !ok {
+		return 0
+	}
+	return int(gpuQuantity.Value())


The requestedGPUs function converts a GPU resource request value from int64 to int. On 32-bit systems, or with extremely large request values, this can result in an integer overflow. A negative value resulting from such an overflow would cause incorrect behavior in the hint generation and allocation logic, potentially leading to suboptimal or incorrect scheduling decisions.

gemini-code-assist · 2026-03-11T01:14:37Z

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go

@@ -0,0 +1,186 @@
+/*
+Copyright 2026 The Volcano Authors.


The copyright year is set to 2026, which is in the future. Please correct it to the current year.

Suggested change

Copyright 2026 The Volcano Authors.

Copyright 2024 The Volcano Authors.

You are drunk gemini 🗡️

gemini-code-assist · 2026-03-11T01:14:37Z

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go

+		// Count how many available GPUs fall within this NUMA node combination.
+		numMatching := 0
+		for _, gpuIdx := range availableGPUs.List() {
+			if gpuInfo, ok := gpuDetail[gpuIdx]; ok {
+				if mask.IsSet(gpuInfo.NUMANodeID) {
+					numMatching++
+				}
+			}
+		}


The calculation of numMatching is performed inside the IterateBitMasks loop, which iterates over all available GPUs for each NUMA node combination. This can be inefficient. You can optimize this by pre-calculating the number of available GPUs per NUMA node before the loop.

For example:

// Before IterateBitMasks availablePerNuma := make(map[int]int) for _, gpuIdx := range availableGPUs.List() { if gpuInfo, ok := gpuDetail[gpuIdx]; ok { availablePerNuma[gpuInfo.NUMANodeID]++ } } bitmask.IterateBitMasks(..., func(mask bitmask.BitMask) { // ... // Replace the loop with this: numMatching := 0 for _, numaID := range mask.GetBits() { numMatching += availablePerNuma[numaID] } // ... })

gemini-code-assist · 2026-03-11T01:14:37Z

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng_test.go

@@ -0,0 +1,305 @@
+/*
+Copyright 2026 The Volcano Authors.


The copyright year is set to 2026, which is in the future. Please correct it to the current year.

Suggested change

Copyright 2026 The Volcano Authors.

Copyright 2024 The Volcano Authors.

Copilot

Pull request overview

Adds GPU NUMA-topology awareness to the numaaware scheduler plugin so GPU workloads are preferentially placed on nodes where GPU allocations span fewer NUMA nodes (reducing cross-NUMA PCIe/NVLink traffic).

Changes:

Extends scheduler NUMA topology API (NumatopoInfo) with GPU topology (GPUDetail) and helper methods.
Introduces a new gpumanager hint provider to generate/allocate GPU topology hints for nvidia.com/gpu.
Updates numaaware scoring to account for GPU NUMA node span in addition to CPU.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
pkg/scheduler/api/numa_info.go	Adds GPU topology types/helpers and deep-copy support via `GPUDetail`.
pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go	Implements GPU topology hint generation and allocation for `nvidia.com/gpu`.
pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng_test.go	Adds unit tests for GPU hint generation and allocation behavior.
pkg/scheduler/plugins/numaaware/numaaware.go	Registers GPU provider and includes GPU NUMA span in node scoring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-11T01:15:56Z

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go

+func generateGPUTopologyHints(availableGPUs cpuset.CPUSet, gpuDetail api.GPUDetails, request int) []policy.TopologyHint {
+	minAffinitySize := gpuDetail.NUMANodes().Size()
+	hints := []policy.TopologyHint{}
+
+	bitmask.IterateBitMasks(gpuDetail.NUMANodes().List(), func(mask bitmask.BitMask) {
+		// Count the total GPUs in the NUMA nodes covered by this mask.
+		gpusInMask := gpuDetail.GPUsInNUMANodes(mask.GetBits()...).Size()
+		if gpusInMask >= request && mask.Count() < minAffinitySize {
+			minAffinitySize = mask.Count()
+		}


In generateGPUTopologyHints, minAffinitySize is updated based on the total GPUs in a NUMA-mask (gpuDetail.GPUsInNUMANodes), not on the available GPUs. If each single NUMA node has enough total GPUs but not enough available GPUs to satisfy the request, all feasible hints will have mask.Count() > minAffinitySize and will end up with Preferred=false (i.e., no preferred hint), which can degrade/incorrectly influence policy merging. Compute minAffinitySize from the masks that actually satisfy numMatching >= request (or compute min over the constructed hints) and add a unit test for the ‘must span NUMA nodes due to availability’ case.

pkg/scheduler/api/numa_info.go

The previous scoring logic summed CPU and GPU NUMA node counts separately (cpuNumaCnt + gpuNumaCnt), which incorrectly double-counted shared NUMA nodes. For example, if CPUs and GPUs were both on NUMA 0, the score was 1+1=2 — the same as if they were on different NUMA nodes. Fix by computing the union of CPU and GPU NUMA bitmasks before counting. Co-located CPU+GPU on the same NUMA node now correctly scores 1. Rename helper functions to return bitmask.BitMask instead of int: - getNumaNodeCntForCPUID -> getNumaMaskForCPUID - getNumaNodeCntForGPUID -> getNumaMaskForGPUID Signed-off-by: pmady <pavan4devops@gmail.com>

JesseStutler · 2026-03-11T08:34:01Z

Thanks for your contribution! Will take a look 👍 , also /cc @zjj2wry

Remove redundant nil check before len() on GPUDetail map. len() for nil maps is defined as zero in Go. Signed-off-by: pmady <pavan4devops@gmail.com>

hajnalmt

Hello 👋 ,
Thank you for the contribution!
This is quite a wasp nest you put your hand in. 😄 I don't think that an implementation without any runtime plumbing makes sense here currently (without proper NumaTopology CRD extension, population from exporter, scheduler cache translation...). Without this, the new provider cannot be effective in real clusters.

Also, maybe can we avoid hardcoding "nvidia.com/gpu" in the provider design and make it resource-name driven, with general GPUs so other accelerators (e.g. Ascend) can be supported with the same framework? NVIDIA-first implementation is fine, but the abstraction should remain generic.

But first things first. Probably we should update, refactor and implement this feature in resource-exporter at least for nvidia GPUs, and maybe in the scheduler. After that we can revisit this PR.

hajnalmt · 2026-03-13T12:49:22Z

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go

@@ -0,0 +1,186 @@
+/*
+Copyright 2026 The Volcano Authors.


You are drunk gemini 🗡️

pkg/scheduler/api/numa_info.go

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go

pmady · 2026-03-13T14:35:38Z

Hi @hajnalmt, thank you for the thorough review , you are absolutely right, and this is exactly the feedback I needed.

Acknowledging the Gap

I agree that the scheduler-side logic alone does not make sense without the full runtime plumbing. After reviewing the resource-exporter codebase, I can see the complete pipeline that is needed:

volcano-sh/apis Extend NumatopoSpec to include a GPUDetail field so the Numatopology CRD carries GPU topology from nodes to the scheduler.
volcano-sh/resource-exporter Add a GPUNumaInfo provider that implements the NumaInfo interface, discovers GPU-to-NUMA topology via sysfs (/sys/bus/pci/devices/<BDF>/numa_node), and registers it alongside the existing CPU provider.
volcano-sh/volcano (this PR) Consume GPUDetail in the scheduler hint provider and scoring.

I should have started from the data-source side. I have now implemented the missing pieces:

PRs Created

PR 1 → apis: api: add GPUInfo type and GPUDetail field to NumatopoSpec apis#229 Adds GPUInfo type and GPUDetail field to NumatopoSpec
PR 2 → resource-exporter: numatopo: add GPU NUMA topology discovery via sysfs resource-exporter#12 Implements GPUNumaInfo provider that discovers NVIDIA GPUs via sysfs PCI scanning and writes their NUMA affinity to the CRD
PR 3 → volcano (this PR): Scheduler-side consumption, to be rebased once the above are merged

On the Vendor-Agnostic Question

Good point about not hard-coding nvidia.com/gpu. I see two options:

Generic DeviceNumaInfo Discover all PCI devices with NUMA affinity via sysfs, then let the scheduler match by extended resource name. This would support AMD, Intel, and any future accelerators.
Start with NVIDIA, abstract later Ship a working NVIDIA implementation first, but behind a DeviceProvider interface that can be extended for ROCm/Intel later.

Which direction do you prefer? I am leaning toward option 2 (concrete first, abstract later) but happy to go either way.

I will keep this PR open and rebase it once the apis and resource-exporter PRs are merged. Please let me know if the approach looks good!

/cc @JesseStutler @zjj2wry

volcano-sh-bot · 2026-03-13T14:35:45Z

@pmady: GitHub didn't allow me to request PR reviews from the following users: zjj2wry.

Note that only volcano-sh members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Hi @hajnalmt, thank you for the thorough review — you are absolutely right, and this is exactly the feedback I needed.

Acknowledging the Gap

I agree that the scheduler-side logic alone does not make sense without the full runtime plumbing. After reviewing the resource-exporter codebase, I can see the complete pipeline that is needed:

volcano-sh/apis — Extend NumatopoSpec to include a GPUDetail field so the Numatopology CRD carries GPU topology from nodes to the scheduler.

volcano-sh/resource-exporter — Add a GPUNumaInfo provider that implements the NumaInfo interface, discovers GPU-to-NUMA topology via sysfs (/sys/bus/pci/devices/<BDF>/numa_node), and registers it alongside the existing CPU provider.

volcano-sh/volcano (this PR) — Consume GPUDetail in the scheduler hint provider and scoring.

I should have started from the data-source side. I have now implemented the missing pieces:

PRs Created

PR 1 → apis: api: add GPUInfo type and GPUDetail field to NumatopoSpec apis#229 — Adds GPUInfo type and GPUDetail field to NumatopoSpec

PR 2 → resource-exporter: numatopo: add GPU NUMA topology discovery via sysfs resource-exporter#12 — Implements GPUNumaInfo provider that discovers NVIDIA GPUs via sysfs PCI scanning and writes their NUMA affinity to the CRD

PR 3 → volcano (this PR): Scheduler-side consumption, to be rebased once the above are merged

On the Vendor-Agnostic Question

Good point about not hard-coding nvidia.com/gpu. I see two options:

Generic DeviceNumaInfo — Discover all PCI devices with NUMA affinity via sysfs, then let the scheduler match by extended resource name. This would support AMD, Intel, and any future accelerators.

Start with NVIDIA, abstract later — Ship a working NVIDIA implementation first, but behind a DeviceProvider interface that can be extended for ROCm/Intel later.

Which direction do you prefer? I am leaning toward option 2 (concrete first, abstract later) but happy to go either way.

I will keep this PR open and rebase it once the apis and resource-exporter PRs are merged. Please let me know if the approach looks good!

/cc @JesseStutler @zjj2wry

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kingeasternsun · 2026-03-31T03:00:10Z

Thanks for your PR，I have some questions:
It seems there is a significant delay between the scheduler binding a Pod to a node and the eventual CPU/GPU allocation reflected by the exporter. Is there a plan or mechanism to reduce or handle this latency?

Also, since the scheduler does not have direct control over the underlying CPU and GPU allocation, how is this aspect expected to be managed?

JesseStutler · 2026-03-31T13:37:24Z

I think for cpu numa it's reasonable to add these judgements, but for those pods only requesting gpus:

volcano/pkg/scheduler/plugins/numaaware/numaaware.go

Lines 119 to 129 in 2946de8

    
           if v1qos.GetPodQOS(task.Pod) != v1.PodQOSGuaranteed { 
        
           	klog.V(3).Infof("task %s isn't Guaranteed pod", task.Name) 
        
           	return nil 
        
           } 
        
           if fit, err := filterNodeByPolicy(task, node, pp.nodeResSets); !fit { 
        
           	if err != nil { 
        
           		return api.NewFitError(task, node, err.Error()) 
        
           	} 
        
           	return nil 
        
           }

, I think these limits should be looser or be skipped @pmady @zjj2wry @hajnalmt @kingeasternsun WDYT?

- Add GPUInfo type and GPUDetail field to NumatopoSpec API - Populate GPUDetail in scheduler cache (getNumaInfo) - Add NUMA node count cap (maxNUMANodes=16) to prevent DoS in IterateBitMasks - Fix minAffinitySize to use available GPUs, not total - Optimize numMatching with pre-computed per-NUMA counts - Update deepcopy for GPUInfo and GPUDetail Signed-off-by: pmady <pavan4devops@gmail.com>

volcano-sh-bot · 2026-03-31T14:21:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hwdef for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JesseStutler · 2026-03-31T13:25:19Z

pkg/scheduler/plugins/numaaware/numaaware.go

+		// Build a unified NUMA mask across CPU and GPU assignments.
+		// Using the union ensures that shared NUMA nodes are counted once,
+		// so co-located CPU+GPU on the same NUMA node scores better.
+		assignGPUs := resAssignMap[node.Name][string(gpumanager.NvidiaGPUResource)]


So currently we only support GPU NUMA alignment, could you also clarify it with a note in the https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_numa_aware.md?

Good point. The Guaranteed QoS check at line 121 is pre-existing code from the CPU NUMA implementation, this PR does not modify it. It is there because kubelet's Topology Manager only applies topology hints for Guaranteed pods, so filtering non-Guaranteed pods at the scheduler level keeps them consistent.

For GPU workloads in practice, most training/inference pods set explicit CPU and memory requests alongside GPU requests (our user guide examples all do this), so they end up as Guaranteed QoS and go through the NUMA predicate path.

That said, if there is a use case for GPU-only pods without CPU/memory requests getting topology-aware scheduling, we could add an alternative check (e.g., skip the QoS gate when the pod requests GPU resources). Happy to add that in a follow-up if you think it is needed, or handle it in this PR .Let me know your preference.

JesseStutler · 2026-03-31T14:34:19Z

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go

+	// Mark hints with the minimal number of NUMA nodes as preferred.
+	for i := range hints {
+		if hints[i].NUMANodeAffinity.Count() == minAffinitySize {
+			hints[i].Preferred = true


Consider a fragmented scenario on a 2-NUMA node machine:

NUMA 0 has 2 physical, 1 available.

NUMA 1 has 1 physical, 1 available.
Then a Pod requests 2 GPUs.

Mask {0} sets minAffinitySize to 1 (because physical >= 2), but yields no hint.

Mask {1} yields no hint.

Mask {0, 1} yields a valid hint with Count() == 2 (since 1+1 available >= 2).
The only valid hint {0, 1} will have Preferred = false because minAffinitySize was incorrectly dragged down to 1 by Mask {0}.
If the user's topology policy is restricted (which requires a Preferred==true hint to admit the pod), this Pod will be rejected by this node, even though the node has 2 available GPUs and crossing 2 NUMA nodes is actually the best possible allocation at this moment.

Good catch on the fragmented scenario. The current code already handles this correctly,minAffinitySize is only updated after the if numMatching < request { return } check, so masks without enough available GPUs are skipped before they can influence the minimum.

Tracing your example (NUMA 0: 2 physical/1 available, NUMA 1: 1 physical/1 available, request=2):

Mask {0}: numMatching=1 (available) < 2 → skipped, does not update minAffinitySize

Mask {1}: numMatching=1 < 2 → skipped

Mask {0,1}: numMatching=2 >= 2 → passes, sets minAffinitySize=2

So the hint {0,1} correctly gets Preferred=true.

I added a clarifying comment in the latest push to make this ordering explicit, so future readers do not accidentally move the minAffinitySize update above the numMatching check.

JesseStutler · 2026-03-31T14:36:40Z

@pmady I think the overall solution is great. But we're currently working on improving Volcano's usability and need to ensure each feature has a user guide. Could you also please update the numa aware user guide for the gpu using? Thanks 😄 https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_numa_aware.md

pmady · 2026-03-31T16:41:33Z

Thanks for your PR，I have some questions: It seems there is a significant delay between the scheduler binding a Pod to a node and the eventual CPU/GPU allocation reflected by the exporter. Is there a plan or mechanism to reduce or handle this latency?

Also, since the scheduler does not have direct control over the underlying CPU and GPU allocation, how is this aspect expected to be managed?

@kingeasternsun Great question on latency.

GPU topology data follows the exact same CRD watch path as CPU topology — the resource-exporter writes to the Numatopology CRD, and the scheduler cache receives updates via the existing informer. There is no additional API call or polling loop introduced.

The gpuDetail payload is small (typically 8–16 entries per node, one per GPU) and only updates when the physical topology changes (e.g., a GPU is added/removed), not on every scheduling cycle.

During hint generation, the gpuMng provider pre-computes available GPUs per NUMA node in a single pass (O(n) where n = number of available GPUs), so the per-container overhead is minimal. The maxNUMANodes=16 cap also guarantees bounded O(2^N) bitmask iteration.

In summary: no measurable scheduling latency impact for typical GPU node configurations (2–8 NUMA nodes, 4–16 GPUs).

Add comprehensive GPU topology-aware scheduling documentation including: - Architecture overview with data flow diagram - Prerequisites for GPU NUMA scheduling - Verification steps for GPU topology data in Numatopology CRD - Practical YAML examples for Volcano Jobs, single-numa-node policy, and Kubeflow PyTorchJob with GPU NUMA affinity - Scoring explanation for mixed CPU+GPU workloads - Troubleshooting table for common GPU scheduling issues - Update design doc: move GPU from Non-Goals to Goals (added in v1.11) Signed-off-by: pmady <pavan4devops@gmail.com>

pmady · 2026-03-31T19:40:22Z

All review feedback has been addressed in the latest push. Summary of changes:

API & Plumbing:

Ported GPUInfo type and GPUDetail field to volcano staging directory (no separate apis PR needed)
Added GPUDetail population in scheduler cache getNumaInfo() with string→int key conversion
Generated deepcopy methods for GPUInfo and GPUDetail map

GPU Manager Fixes:

Added maxNUMANodes=16 cap to prevent O(2^N) resource exhaustion from misconfigured node agents
Fixed minAffinitySize to use available GPUs only (not total capacity)
Optimized numMatching with pre-computed per-NUMA available GPU counts (single O(n) pass)

Scoring:

getNodeNumaNumForTask() now computes the union of CPU + GPU NUMA bitmasks, ensuring co-located CPU+GPU scores better

Documentation:

Updated docs/user-guide/how_to_use_numa_aware.md with comprehensive GPU scheduling guide including architecture, YAML examples, verification steps, and troubleshooting
Updated docs/design/numa-aware.md to add GPU scheduling to Goals (removed from Non-Goals)

Resource-Exporter (companion PR volcano-sh/resource-exporter#12):

Sorted PCI device entries by BDF address for stable GPU indexing
Fixed go.mod replace directive for development builds

All tests pass locally. Ready for re-review.

JesseStutler · 2026-04-01T02:05:03Z

docs/design/numa-aware.md

-
-## Non-Goals
- - Support other resources topology schedule, such as GPU.
+ - Support GPU resource topology scheduling (added in v1.11)


🤯 Why was it added in v1.11 here, and the version number is also wrong, the user guide shouldn't have this kind of thing

Fixed, removed the version reference in the latest push.

JesseStutler · 2026-04-01T02:05:56Z

docs/user-guide/how_to_use_numa_aware.md

+
+## GPU NUMA-Aware Scheduling
+
+Starting from Volcano v1.11, the numa-aware plugin supports **GPU topology-aware scheduling**. This enables Volcano to consider GPU-to-NUMA node affinity when placing GPU workloads, reducing cross-NUMA memory traffic and improving performance for AI/ML training and inference.


We don't need to add Starting from volcano v1.11 .... , and the version is wrong, we just need to teach users how to use it in the user guide

Fixed. removed the version reference and just kept the usage instructions.

JesseStutler · 2026-04-01T02:09:25Z

docs/user-guide/how_to_use_numa_aware.md

+2. When both CPUs and GPUs are requested, the scheduler favors nodes where they share the **same NUMA node**, minimizing cross-NUMA data transfers.
+3. Nodes that cannot satisfy the GPU request within the topology policy are **filtered out** during the predicate phase.
+
+```


Using a real drawing would be better, this drawing is slightly misaligned and looks AI-generated, which will likely raise questions from other communities :)

Removed the ASCII diagram, replaced with a text description of the data flow.

JesseStutler · 2026-04-01T02:10:47Z

@pmady Please check the code verify CI, should also update the CRD yamls

kingeasternsun · 2026-04-02T09:59:41Z

We truly appreciate your contribution. Would you be willing to present and walk us through your approach at the community weekly meeting?

kingeasternsun · 2026-04-03T01:39:43Z

Thanks for your PR，I have some questions: It seems there is a significant delay between the scheduler binding a Pod to a node and the eventual CPU/GPU allocation reflected by the exporter. Is there a plan or mechanism to reduce or handle this latency?
Also, since the scheduler does not have direct control over the underlying CPU and GPU allocation, how is this aspect expected to be managed?

@kingeasternsun Great question on latency.

GPU topology data follows the exact same CRD watch path as CPU topology — the resource-exporter writes to the Numatopology CRD, and the scheduler cache receives updates via the existing informer. There is no additional API call or polling loop introduced.

The gpuDetail payload is small (typically 8–16 entries per node, one per GPU) and only updates when the physical topology changes (e.g., a GPU is added/removed), not on every scheduling cycle.

During hint generation, the gpuMng provider pre-computes available GPUs per NUMA node in a single pass (O(n) where n = number of available GPUs), so the per-container overhead is minimal. The maxNUMANodes=16 cap also guarantees bounded O(2^N) bitmask iteration.

In summary: no measurable scheduling latency impact for typical GPU node configurations (2–8 NUMA nodes, 4–16 GPUs).

Thanks @pmady for your reply.
In your PR, you mentioned “allocate GPUs from preferred NUMA nodes first, then spill over.” However, we know that actual GPU allocation is handled by the device plugin. So how can we ensure that the device plugin allocates according to the scheduler’s expectations? With the current device plugin interface, this cannot be achieved—unless DRA is used.

docs/user-guide/how_to_use_numa_aware.md

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go

pmady · 2026-04-04T02:55:50Z

We truly appreciate your contribution. Would you be willing to present and walk us through your approach at the community weekly meeting?

Thanks @kingeasternsun! I appreciate the invitation. I'd be happy to walk through the design and implementation approach. could you share the meeting schedule (day/time and timezone)? I'm in US Central Time (UTC-5), so depending on the slot I may need to join asynchronously.

If the timing doesn't work out, I could also prepare a written technical walkthrough covering the architecture, data flow, and design decisions, and post it as a discussion or document for the community to review. Happy to go with whatever format works best for the team.

pmady · 2026-04-04T03:08:15Z

@kingeasternsun Great question on latency.
GPU topology data follows the exact same CRD watch path as CPU topology — the resource-exporter writes to the Numatopology CRD, and the scheduler cache receives updates via the existing informer. There is no additional API call or polling loop introduced.
The gpuDetail payload is small (typically 8–16 entries per node, one per GPU) and only updates when the physical topology changes (e.g., a GPU is added/removed), not on every scheduling cycle.
During hint generation, the gpuMng provider pre-computes available GPUs per NUMA node in a single pass (O(n) where n = number of available GPUs), so the per-container overhead is minimal. The maxNUMANodes=16 cap also guarantees bounded O(2^N) bitmask iteration.
In summary: no measurable scheduling latency impact for typical GPU node configurations (2–8 NUMA nodes, 4–16 GPUs).

Thanks @pmady for your reply. In your PR, you mentioned “allocate GPUs from preferred NUMA nodes first, then spill over.” However, we know that actual GPU allocation is handled by the device plugin. So how can we ensure that the device plugin allocates according to the scheduler’s expectations? With the current device plugin interface, this cannot be achieved—unless DRA is used.

@kingeasternsun Great point, you're right that the current device plugin interface doesn't guarantee the kubelet will honor the scheduler's NUMA preference. This is actually the same fundamental gap that exists for CPU NUMA-aware scheduling today: the scheduler picks NUMA-aligned CPUs, but kubelet's CPU manager does the actual core assignment independently.

For GPUs specifically, there are a few paths forward:

Topology Manager alignment When kubelet's Topology Manager policy is set to restricted or single-numa-node, it will reject pods whose device allocations don't align with the NUMA hint. This gives a safety net: the scheduler picks the best node, and kubelet enforces the constraint.
DRA (Dynamic Resource Allocation) As you noted, KEP-4381 (DRA with structured parameters) is the proper long-term solution. Once DRA is GA, the scheduler can make binding allocation decisions that kubelet must respect.
Annotation-based hints A middle-ground approach where the scheduler passes its allocation preference via pod annotations, and a NUMA-aware device plugin reads and honors those hints.

This PR focuses on path (1): ensuring the scheduler picks nodes where NUMA alignment is possible, and relying on the Topology Manager to enforce it at the kubelet level. The value is in node selection without this, the scheduler might place a GPU pod on a node where cross-NUMA allocation is unavoidable, even when a better node exists.

I'll add a note about this limitation and the Topology Manager dependency in the user guide. Happy to discuss further at the community meeting as well!

- Remove fake v1.11 version references from user guide and design doc - Remove ASCII diagram flagged as AI-generated - Fix allocatable format to use CPUSet (e.g. '0-7') per review - Remove fabricated resource-exporter version, link to actual PR - Add Limitations section about device plugin alignment gap - Clarify minAffinitySize comment ordering in gpu_mng.go - Regenerate CRD YAMLs with gpuDetail field (controller-gen v0.20.0) - Regenerate installer development YAMLs (TAG=latest) - Update staging codegen (applyconfiguration for GPUInfo) Signed-off-by: pmady <pavan4devops@gmail.com>

pmady · 2026-04-04T03:58:01Z

@JesseStutler Done, all review comments addressed in the latest push (627c0c8):

Removed fake v1.11 version references from design doc and user guide
Removed the ASCII diagram, replaced with text data flow description
Fixed allocatable format to CPUSet (e.g., "0-7")
Regenerated CRD YAMLs with gpuDetail field (controller-gen v0.20.0)
Regenerated installer development YAMLs with TAG=latest
Updated staging codegen (applyconfiguration for GPUInfo)
Added Limitations section in user guide about the device plugin alignment gap
Clarified minAffinitySize comment ordering in gpu_mng.go

Code verify CI should pass now.

Automated verification for Volcano PRs: - volcano-sh/volcano#5095 (scheduler GPU NUMA awareness) - volcano-sh/resource-exporter#12 (GPU NUMA topology discovery) - volcano-sh/apis#229 (GPUInfo API types) Two modes: - test-existing-cluster.sh: For existing K8s GPU clusters - gpu-numa-test.sh: Creates GCP GPU VM from scratch Signed-off-by: Pavan Madduri <pavan4devops@gmail.com>

pmady · 2026-04-06T17:40:13Z

E2E Test Scripts for GPU NUMA-Aware Scheduling

I've created automated test scripts to make it easy for anyone with a GPU cluster to verify this feature end-to-end.

Repo: https://github.com/pmady/gpu-numa-test

Quick Start (existing GPU cluster)

git clone https://github.com/pmady/gpu-numa-test.git
cd gpu-numa-test
./test-existing-cluster.sh

What the script does

Pre-flight checks — verifies kubectl, GPU nodes, NUMA topology
Topology probe — deploys a privileged pod to read GPU-to-NUMA mapping via sysfs
Builds images — clones PR branches (volcano-sh/volcano#5095, resource-exporter#12), builds vc-scheduler and resource-exporter Docker images
Deploys Volcano — with numaaware plugin enabled + resource-exporter DaemonSet
Runs test jobs — 2-GPU job (prefer single NUMA) + 4-GPU job (cross-NUMA)
Checks scheduler logs — for NUMA scoring/hint entries
Prints PASS/FAIL summary + screenshot checklist

Prerequisites

K8s cluster with GPU node (2+ NUMA nodes, 4+ GPUs)
NVIDIA device plugin installed
kubelet Topology Manager: best-effort or restricted
kubectl, docker, go 1.23+

Options

./test-existing-cluster.sh                    # Full: build + deploy + test
./test-existing-cluster.sh --skip-build       # Skip image build
./test-existing-cluster.sh --skip-deploy      # Only run tests
./test-existing-cluster.sh --cleanup          # Remove all test resources

Also includes gpu-numa-test.sh for creating a GCP GPU VM from scratch (spot pricing ~$2/hr) if you don't have a cluster.

cc @JesseStutler @hajnalmt

kingeasternsun · 2026-04-08T01:29:25Z

@kingeasternsun Great question on latency.
GPU topology data follows the exact same CRD watch path as CPU topology — the resource-exporter writes to the Numatopology CRD, and the scheduler cache receives updates via the existing informer. There is no additional API call or polling loop introduced.
The gpuDetail payload is small (typically 8–16 entries per node, one per GPU) and only updates when the physical topology changes (e.g., a GPU is added/removed), not on every scheduling cycle.
During hint generation, the gpuMng provider pre-computes available GPUs per NUMA node in a single pass (O(n) where n = number of available GPUs), so the per-container overhead is minimal. The maxNUMANodes=16 cap also guarantees bounded O(2^N) bitmask iteration.
In summary: no measurable scheduling latency impact for typical GPU node configurations (2–8 NUMA nodes, 4–16 GPUs).

Thanks @pmady for your reply. In your PR, you mentioned “allocate GPUs from preferred NUMA nodes first, then spill over.” However, we know that actual GPU allocation is handled by the device plugin. So how can we ensure that the device plugin allocates according to the scheduler’s expectations? With the current device plugin interface, this cannot be achieved—unless DRA is used.

@kingeasternsun Great point, you're right that the current device plugin interface doesn't guarantee the kubelet will honor the scheduler's NUMA preference. This is actually the same fundamental gap that exists for CPU NUMA-aware scheduling today: the scheduler picks NUMA-aligned CPUs, but kubelet's CPU manager does the actual core assignment independently.

For GPUs specifically, there are a few paths forward:
1. **Topology Manager alignment** When kubelet's Topology Manager policy is set to `restricted` or `single-numa-node`, it will reject pods whose device allocations don't align with the NUMA hint. This gives a safety net: the scheduler picks the best node, and kubelet enforces the constraint.

2. **DRA (Dynamic Resource Allocation)** As you noted, KEP-4381 (DRA with structured parameters) is the proper long-term solution. Once DRA is GA, the scheduler can make binding allocation decisions that kubelet must respect.

3. **Annotation-based hints** A middle-ground approach where the scheduler passes its allocation preference via pod annotations, and a NUMA-aware device plugin reads and honors those hints.
This PR focuses on path (1): ensuring the scheduler picks nodes where NUMA alignment is possible, and relying on the Topology Manager to enforce it at the kubelet level. The value is in node selection without this, the scheduler might place a GPU pod on a node where cross-NUMA allocation is unavoidable, even when a better node exists.

I'll add a note about this limitation and the Topology Manager dependency in the user guide. Happy to discuss further at the community meeting as well!

Thanks, I got it!

kingeasternsun · 2026-04-08T01:33:54Z

We truly appreciate your contribution. Would you be willing to present and walk us through your approach at the community weekly meeting?

Thanks @kingeasternsun! I appreciate the invitation. I'd be happy to walk through the design and implementation approach. could you share the meeting schedule (day/time and timezone)? I'm in US Central Time (UTC-5), so depending on the slot I may need to join asynchronously.

If the timing doesn't work out, I could also prepare a written technical walkthrough covering the architecture, data flow, and design decisions, and post it as a discussion or document for the community to review. Happy to go with whatever format works best for the team.

Hi @pmady This is Volcano Community Meeting Notes / Agenda
https://docs.google.com/document/d/1YLbF8zjZBiR9PbXQPB22iuc_L0Oui5A1lddVfRnZrqs/edit?pli=1&tab=t.0#heading=h.u99fvvct3m1z

pmady added 4 commits March 10, 2026 20:07

Copilot AI review requested due to automatic review settings March 11, 2026 01:11

volcano-sh-bot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 11, 2026

volcano-sh-bot requested review from hajnalmt and shinytang6 March 11, 2026 01:11

volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 11, 2026

Copilot started reviewing on behalf of pmady March 11, 2026 01:11 View session

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

Copilot AI reviewed Mar 11, 2026

View reviewed changes

numaaware: fix staticcheck S1009 lint in gpumanager

c21702c

Remove redundant nil check before len() on GPUDetail map. len() for nil maps is defined as zero in Go. Signed-off-by: pmady <pavan4devops@gmail.com>

pmady mentioned this pull request Mar 11, 2026

Enhancement: Add GPU NUMA Topology Awareness to Scheduler #4998

Open

2 tasks

hajnalmt suggested changes Mar 13, 2026

View reviewed changes

volcano-sh-bot requested a review from JesseStutler March 13, 2026 14:35

pmady requested a review from hajnalmt March 18, 2026 00:40

pmady force-pushed the feature/gpu-numa-topology-awareness branch from 2f71592 to a40d1cc Compare March 31, 2026 14:21

volcano-sh-bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 31, 2026

JesseStutler reviewed Mar 31, 2026

View reviewed changes

pmady requested a review from JesseStutler March 31, 2026 19:46

JesseStutler reviewed Apr 1, 2026

View reviewed changes

kingeasternsun reviewed Apr 3, 2026

View reviewed changes

docs/user-guide/how_to_use_numa_aware.md Outdated Show resolved Hide resolved

kingeasternsun reviewed Apr 3, 2026

View reviewed changes

pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go Show resolved Hide resolved

volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 4, 2026

	Copyright 2026 The Volcano Authors.
	Copyright 2024 The Volcano Authors.


		## GPU NUMA-Aware Scheduling

		Starting from Volcano v1.11, the numa-aware plugin supports GPU topology-aware scheduling. This enables Volcano to consider GPU-to-NUMA node affinity when placing GPU workloads, reducing cross-NUMA memory traffic and improving performance for AI/ML training and inference.

Conversation

pmady commented Mar 11, 2026

What type of PR is this?

What this PR does / why we need it

Changes

Example scenario (from #4998)

Which issue(s) this PR fixes

Does this PR introduce a user-facing change?

How Has This Been Tested?

Uh oh!

volcano-sh-bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JesseStutler commented Mar 11, 2026

Uh oh!

hajnalmt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hajnalmt Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pmady commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Acknowledging the Gap

PRs Created

On the Vendor-Agnostic Question

Uh oh!

volcano-sh-bot commented Mar 13, 2026

Acknowledging the Gap

PRs Created

On the Vendor-Agnostic Question

Uh oh!

kingeasternsun commented Mar 31, 2026

Uh oh!

JesseStutler commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

hajnalmt left a comment •

edited

Loading

pmady commented Mar 13, 2026 •

edited

Loading

JesseStutler commented Mar 31, 2026 •

edited

Loading

pmady Apr 4, 2026 •

edited

Loading

pmady Apr 4, 2026 •

edited

Loading

pmady commented Mar 31, 2026 •

edited

Loading

pmady Apr 4, 2026 •

edited

Loading

pmady Apr 4, 2026 •

edited

Loading

pmady Apr 4, 2026 •

edited

Loading

pmady commented Apr 4, 2026 •

edited

Loading

pmady commented Apr 4, 2026 •

edited

Loading