Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 117 additions & 2 deletions docs/design/hierarchical-queue-on-capacity-plugin.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,16 @@ For example, consider the following hierarchical queue structure and jobs:

When job A performs reclaim or preempt actions, the system should first consider job B (belonging to the same parent queue a), and then consider job C (belonging to the same parent queue root).

### Story 5

When `parentBasedReclaimEnabled` is enabled for the capacity plugin and hierarchy is enabled, cross-parent reclaim should respect parent-level deserved resources.

If a child queue is over its own deserved but its parent queue is still below parent deserved, the child's extra usage is treated as intra-parent borrowing and should not be reclaimed by other parent trees.

Reclaim from that child becomes eligible only when both the child queue and the relevant parent queue are over deserved on reclaim-relevant dimensions.

For sibling reclaim under the same direct parent, existing child-level reclaim checks remain in effect.

## Design detail

### Webhook
Expand Down Expand Up @@ -124,11 +134,116 @@ When job A performs reclaim or preempt actions, the system should first consider

- `VictimTaskOrderFn`: Prioritize tasks that belong to the same parent queue as the preemptor. If necessary, it will consider other tasks from the bottom up in the queue hierarchy.
- `VictimJobOrderFn`: Similar to `VictimTaskOrderFn`, prioritize jobs that belong to the same parent queue as the preemptor. If necessary, it will consider other jobs from the bottom up in the queue hierarchy.

- `updateParentQueue`: Update the resource status of parent queues whenever the `updateShare` function is executed. It will traverse the queue hierarchy from the bottom up and update the resource information of parent queues accordingly.

**Note:** The above modifications are primarily applicable when `EnabledHierarchy` is set to true. If the capacity plugin does not require hierarchical queue management, the existing implementations of these functions will be retained.

#### Configuration example

Enable hierarchy and parent-based reclaim in scheduler configuration:

```yaml
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: capacity
enableHierarchy: true
arguments:
parentBasedReclaimEnabled: true
- name: gang
- name: priority
```

#### Unit test scenarios and behavior map

The table-driven test `Test_capacityPlugin_ParentBasedReclaimScenarios` covers case1-case5 with explicit queue topology and reclaim behavior when `parentBasedReclaimEnabled=true`.

**case1: Can reclaim based on parent deserved when parentBasedReclaimEnabled is true**

```mermaid
graph TD
R[root]
Q1[case1_queue1<br/>deserved: a100=1<br/>capability: unset]
Q11[case1_queue11<br/>deserved: unset<br/>capability: unset]
Q2[case1_queue2<br/>deserved: unset<br/>capability: unset]
R --> Q1
Q1 --> Q11
R --> Q2
```

- Workloads: `p1` running on `n1` in `case1_queue2` requests `a100=4`; `p2` pending in `case1_queue11` requests `a100=1`.
- Setting: `parentBasedReclaimEnabled=true`.
- Expected: `p1` is evicted and `p2` (podgroup `pg2`) is pipelined to `n1`.

**case2: Cross-parent reclaim pipelines project1 pending job when project2 child and parent are both over deserved**

```mermaid
graph TD
R[root]
P1[project1_root<br/>deserved: a100=1<br/>capability: a100=2]
P1NP[project1_non-preemptable<br/>deserved: a100=1<br/>capability: a100=1]
P1P[project1_preemptable<br/>deserved: unset<br/>capability: unset]
P2[project2_root<br/>deserved: a100=3<br/>capability: a100=4]
P2NP[project2_non-preemptable<br/>deserved: a100=3<br/>capability: a100=3]
P2P[project2_preemptable<br/>deserved: unset<br/>capability: unset]
R --> P1
P1 --> P1NP
P1 --> P1P
R --> P2
P2 --> P2NP
P2 --> P2P
```

- Workloads: `p1..p4` running on `n1` in `project2_preemptable` (`a100=1` each), `p5` pending in `project1_preemptable` (`a100=1`).
- Setting: `parentBasedReclaimEnabled=true`.
- Expected: lower-priority `p4` is evicted and `p5` (podgroup `pg5`) is pipelined to `n1`.

**case3: Cross-parent reclaim can evict sibling-queue victim first**

Queue topology is identical to case2.

- Workloads: continuation-style state with `p1..p3` running on `n1` in `project2_preemptable` (`a100=1` each), `p4` pending as reclaimed (`a100=1`), `p5` already running in `project1_preemptable` (`a100=1`), and `p6` pending in `project2_non-preemptable` (`a100=1`).
- Setting: `parentBasedReclaimEnabled=true`.
- Expected: `p3` is evicted and `p6` (podgroup `pg6`) is pipelined to `n1`.

**case4: Cross-parent reclaim is blocked when victim child is not over deserved**

```mermaid
graph TD
R[root]
P1[case4_parent1<br/>deserved: a100=1<br/>capability: unset]
C1[case4_child1<br/>deserved: a100=1<br/>capability: unset]
C1S[case4_child1_sibling<br/>deserved: unset<br/>capability: unset]
P2[case4_parent2<br/>deserved: a100=1<br/>capability: unset]
C2[case4_child2<br/>deserved: a100=1<br/>capability: unset]
R --> P1
P1 --> C1
P1 --> C1S
R --> P2
P2 --> C2
```

- Workloads: `p1` running on `n1` in `case4_child1` (`a100=1`), `p3` running in sibling queue (non-preemptable, `a100=1`), `p2` pending in `case4_child2` (`a100=1`).
- Setting: `parentBasedReclaimEnabled=true`.
- Expected: no eviction and no pipeline.

**case5: Parent-based reclaim is blocked when sibling leaves have no relevant deserved signal**

```mermaid
graph TD
R[root]
P[parent<br/>deserved: cpu=4, mem=4Gi<br/>capability: cpu=4, mem=4Gi]
QA[queue-a<br/>deserved: unset<br/>capability: unset]
QB[queue-b<br/>deserved: unset<br/>capability: unset]
R --> P
P --> QA
P --> QB
```

- Workloads: `p-victim` running on `n1` in `queue-b` (`cpu=2, mem=2Gi`), `p-reclaimer` pending in `queue-a` (`cpu=2, mem=2Gi`).
- Setting: `parentBasedReclaimEnabled=true`.
- Expected: no eviction and no pipeline.

### Vcctl

- Design relevant vcctl commands, such as commands to obtain the child queues of a specific queue or commands to retrieve the entire hierarchical queue structure.
Expand All @@ -137,4 +252,4 @@ When job A performs reclaim or preempt actions, the system should first consider

- The current design does not support scheduling jobs/podgroups to non-leaf queues. Only leaf queues can directly schedule and allocate resources to jobs/podgroups.
- The current design does not consider the queue migration issues related to hierarchical queue management in order to avoid manual operation and maintenance.
- When introducing a paused scheduling state for queues, optimize the management of the hierarchical queue structure under this state. For example, if a parent queue is paused for scheduling, it will cause its child queues to be paused as well. When resuming the parent queue, provide the capability to automatically resume the child queues in conjunction.
- When introducing a paused scheduling state for queues, optimize the management of the hierarchical queue structure under this state. For example, if a parent queue is paused for scheduling, it will cause its child queues to be paused as well. When resuming the parent queue, provide the capability to automatically resume the child queues in conjunction.
37 changes: 30 additions & 7 deletions pkg/scheduler/framework/session_plugins.go
Original file line number Diff line number Diff line change
Expand Up @@ -656,8 +656,10 @@ func (ssn *Session) SubJobOrderFn(l, r interface{}) bool {
return lv.UID < rv.UID
}

// JobOrderFn invoke joborder function of the plugins
func (ssn *Session) JobOrderFn(l, r interface{}) bool {
// JobOrderCompareFn compares l and r by running enabled JobOrder plugins in
// order and returning the first non-zero comparison result. It returns 0 if
// all plugins consider l and r equal.
func (ssn *Session) JobOrderCompareFn(l, r interface{}) int {
for _, tier := range ssn.Tiers {
for _, plugin := range tier.Plugins {
if !isEnabled(plugin.EnabledJobOrder) {
Expand All @@ -668,11 +670,20 @@ func (ssn *Session) JobOrderFn(l, r interface{}) bool {
continue
}
if j := jof(l, r); j != 0 {
return j < 0
return j
}
}
}

return 0
}

// JobOrderFn invoke joborder function of the plugins
func (ssn *Session) JobOrderFn(l, r interface{}) bool {
if res := ssn.JobOrderCompareFn(l, r); res != 0 {
return res < 0
}

// If no job order funcs, order job by CreationTimestamp first, then by UID.
lv := l.(*api.JobInfo)
rv := r.(*api.JobInfo)
Expand Down Expand Up @@ -1087,9 +1098,21 @@ func (ssn *Session) HyperNodeGradientForSubJobFn(subJob *api.SubJobInfo, hyperNo
}

// BuildVictimsPriorityQueue returns a priority queue with victims sorted by:
// if victims has same job id, sorted by !ssn.TaskOrderFn
// if victims has different job id, sorted by !ssn.JobOrderFn
// 1. If victims belong to the same job, use !ssn.TaskOrderFn.
// 2. If either victim's job is missing, evict orphaned tasks first; if both
// are orphaned, use !ssn.TaskOrderFn.
// 3. If the preemptor job is missing or victims are in the same queue, compare
// jobs with JobOrderCompareFn and use !ssn.TaskOrderFn as a tie-break.
// 4. If victims are in different queues and preemptor job exists, use
// ssn.VictimQueueOrderFn.
func (ssn *Session) BuildVictimsPriorityQueue(victims []*api.TaskInfo, preemptor *api.TaskInfo) *util.PriorityQueue {
jobThenTaskOrder := func(lvJob, rvJob *api.JobInfo, l, r interface{}) bool {
if cmp := ssn.JobOrderCompareFn(lvJob, rvJob); cmp != 0 {
return cmp > 0
}
return !ssn.TaskOrderFn(l, r)
}

victimsQueue := util.NewPriorityQueue(func(l, r interface{}) bool {
lv := l.(*api.TaskInfo)
rv := r.(*api.TaskInfo)
Expand Down Expand Up @@ -1121,14 +1144,14 @@ func (ssn *Session) BuildVictimsPriorityQueue(victims []*api.TaskInfo, preemptor
}

if !preemptorJobFound {
return !ssn.JobOrderFn(lvJob, rvJob)
return jobThenTaskOrder(lvJob, rvJob, l, r)
}

if lvJob.Queue != rvJob.Queue {
return ssn.VictimQueueOrderFn(ssn.Queues[lvJob.Queue], ssn.Queues[rvJob.Queue], ssn.Queues[preemptorJob.Queue])
}

return !ssn.JobOrderFn(lvJob, rvJob)
return jobThenTaskOrder(lvJob, rvJob, l, r)
})
for _, victim := range victims {
victimsQueue.Push(victim)
Expand Down
127 changes: 127 additions & 0 deletions pkg/scheduler/framework/session_plugins_victim_order_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
/*
Copyright 2026 The Volcano Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package framework_test

import (
"testing"
"time"

"github.com/stretchr/testify/assert"
v1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

schedulingv1 "volcano.sh/apis/pkg/apis/scheduling/v1beta1"
"volcano.sh/volcano/pkg/scheduler/api"
"volcano.sh/volcano/pkg/scheduler/conf"
"volcano.sh/volcano/pkg/scheduler/framework"
"volcano.sh/volcano/pkg/scheduler/plugins/priority"
"volcano.sh/volcano/pkg/scheduler/uthelper"
"volcano.sh/volcano/pkg/scheduler/util"
)

func TestBuildVictimsPriorityQueueJobTieBreaksWithTaskOrder(t *testing.T) {
trueValue := true
plugins := map[string]framework.PluginBuilder{priority.PluginName: priority.New}
highPri, lowPri := int32(100), int32(1)
queueQ1 := util.BuildQueue("q1", 1, nil)
nodeN1 := util.BuildNode("n1", api.BuildResourceList("10", "10Gi", []api.ScalarResource{{Name: "pods", Value: "20"}}...), nil)
pgHigh := util.BuildPodGroup("pg-high", "ns1", "q1", 1, nil, schedulingv1.PodGroupRunning)
pgLow := util.BuildPodGroup("pg-low", "ns1", "q1", 1, nil, schedulingv1.PodGroupRunning)
pgHigh.CreationTimestamp = metav1.NewTime(time.Unix(20, 0))
pgLow.CreationTimestamp = metav1.NewTime(time.Unix(10, 0))
pgPreemptor := util.BuildPodGroup("pg-preemptor", "ns1", "q1", 1, nil, schedulingv1.PodGroupPending)
pHigh := util.BuildPodWithPriority("ns1", "p-high", "n1", v1.PodRunning, api.BuildResourceList("1", "1Gi"), "pg-high", nil, nil, &highPri)
pLow := util.BuildPodWithPriority("ns1", "p-low", "n1", v1.PodRunning, api.BuildResourceList("1", "1Gi"), "pg-low", nil, nil, &lowPri)
pPreemptor := util.BuildPodWithPriority("ns1", "p-preemptor", "", v1.PodPending, api.BuildResourceList("1", "1Gi"), "pg-preemptor", nil, nil, &highPri)

tiers := []conf.Tier{{
Plugins: []conf.PluginOption{{
Name: priority.PluginName,
EnabledJobOrder: &trueValue,
EnabledTaskOrder: &trueValue,
}},
}}

findTask := func(ssn *framework.Session, podName string) *api.TaskInfo {
for _, job := range ssn.Jobs {
for _, task := range job.Tasks {
if task.Name == podName {
return task
}
}
}
return nil
}

baseTestStruct := func() uthelper.TestCommonStruct {
return uthelper.TestCommonStruct{
Plugins: plugins,
Queues: []*schedulingv1.Queue{queueQ1.DeepCopy()},
Nodes: []*v1.Node{nodeN1.DeepCopy()},
PodGroups: []*schedulingv1.PodGroup{
pgHigh.DeepCopy(),
pgLow.DeepCopy(),
},
Pods: []*v1.Pod{
pHigh.DeepCopy(),
pLow.DeepCopy(),
},
}
}

t.Run("preemptor job found", func(t *testing.T) {
tc := baseTestStruct()
tc.PodGroups = append(tc.PodGroups, pgPreemptor.DeepCopy())
tc.Pods = append(tc.Pods, pPreemptor.DeepCopy())
ssn := tc.RegisterSession(tiers, nil)
defer tc.Close()

high := findTask(ssn, "p-high")
low := findTask(ssn, "p-low")
preemptor := findTask(ssn, "p-preemptor")
assert.NotNil(t, high)
assert.NotNil(t, low)
assert.NotNil(t, preemptor)
highJob := ssn.Jobs[high.Job]
lowJob := ssn.Jobs[low.Job]
assert.Equal(t, 0, ssn.JobOrderCompareFn(highJob, lowJob))

victims := []*api.TaskInfo{high, low}
pq := ssn.BuildVictimsPriorityQueue(victims, preemptor)
first := pq.Pop().(*api.TaskInfo)
assert.Equal(t, "p-low", first.Name)
})

t.Run("preemptor job missing", func(t *testing.T) {
tc := baseTestStruct()
ssn := tc.RegisterSession(tiers, nil)
defer tc.Close()

high := findTask(ssn, "p-high")
low := findTask(ssn, "p-low")
assert.NotNil(t, high)
assert.NotNil(t, low)
highJob := ssn.Jobs[high.Job]
lowJob := ssn.Jobs[low.Job]
assert.Equal(t, 0, ssn.JobOrderCompareFn(highJob, lowJob))

victims := []*api.TaskInfo{high, low}
pq := ssn.BuildVictimsPriorityQueue(victims, &api.TaskInfo{Job: api.JobID("missing")})
first := pq.Pop().(*api.TaskInfo)
assert.Equal(t, "p-low", first.Name)
})
}
Loading
Loading