Skip to content

GREP-375 add scheduler backend framework#372

Open
kangclzjc wants to merge 46 commits intoai-dynamo:mainfrom
kangclzjc:grep_scheduler_backend
Open

GREP-375 add scheduler backend framework#372
kangclzjc wants to merge 46 commits intoai-dynamo:mainfrom
kangclzjc:grep_scheduler_backend

Conversation

@kangclzjc
Copy link
Contributor

@kangclzjc kangclzjc commented Jan 27, 2026

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Add scheduler backend framework to support multiple scheduler backends

Which issue(s) this PR fixes:

Fixes #275
Fixes #375

Special notes for your reviewer:

Does this PR introduce a API change?


Additional documentation e.g., enhancement proposals, usage docs, etc.:


Signed-off-by: kangclzjc <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
@kangclzjc kangclzjc marked this pull request as ready for review January 27, 2026 12:45
@kangclzjc kangclzjc changed the title GREP add scheduler backend framework GREP-375 add scheduler backend framework Jan 28, 2026
Signed-off-by: kangclzjc <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kangclzjc and others added 8 commits February 3, 2026 15:43
remove useless words

Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
kangclzjc and others added 8 commits February 4, 2026 14:22
Signed-off-by: Kang Zhang <[email protected]>
remove phase1 in limitation

Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>

For detailed lifecycle flow, see [PodGang Lifecycle Changes](#podgang-lifecycle-changes).

### Backend Interface Definition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface currently omits the relationship between ClusterTopology and secondary resources. How do you envision the navigational link from the main topology to other specific Topology CRDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good point. Per my understanding, for each scheduler backend, we should first define the mapping once the backend Initiation, and then we have several hooks like: PreparePod for modify topology label in spec, also SycPodGang hook to translate Topology to specific Topology in other CRDs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

today we dont have controller for Cluster Topology, we would add it as part of multi cluster topology support,
so it might need extension point of its own

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I add an future note in GREP.


#### New Flow (With Framework):
1. **Create PodGang early** with PodGroups having empty PodReferences and `Initialized=False`
2. **Create Pods** (with scheduling gates to block scheduling)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this without using the scheduling Gate, In large scale this would be intensive to modify all the pods spec to remove the scheduling Gate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with u. If we could refine this scheduling gate, that's would be a good enhancement. Maybe we should raise this question and discuss it in another GREP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe different question would would happend if would not use the scheduling gate at all (beside what we do today)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pods would be schedulable as soon as they’re created and couldn't guarantee gang schedule

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. In large scale this would be intensive to modify all the pods spec to remove the scheduling Gate. I will create another issue to track this.

kangclzjc and others added 2 commits February 6, 2026 16:41
Move scheduler string to struct

Co-authored-by: Ron Kahn <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
@kangclzjc kangclzjc force-pushed the grep_scheduler_backend branch from b819afc to 4959656 Compare February 13, 2026 01:34
kangclzjc and others added 6 commits February 14, 2026 11:47
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
@kangclzjc kangclzjc force-pushed the grep_scheduler_backend branch from 7012497 to 3ac9845 Compare February 18, 2026 06:10
Signed-off-by: Kang Zhang <[email protected]>
Copy link
Collaborator

@sanjaychatterjee sanjaychatterjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Made a couple of minor suggestions to update the GREP. Thanks!

2. Wait for all pods to have back-references to PodGang
3. Create PodGang with complete PodReferences

#### New Flow (With Framework):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please clarify the flow when the scheduler backend will create a scheduler specific CRs for the workload, e.g. PodGroup for KAI, or Workload for kube-scheduler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added. After PodGang created, this backend will create a scheduler specific CR like below.

1. **Create PodGang early** with PodGroups having empty PodReferences and `Initialized=False`.
2. **Backend creates scheduler-specific CRs**: The Backend Controller reconciles the new PodGang and calls `SyncPodGang()` on the resolved backend. The backend creates or updates its scheduler-specific resources (e.g. PodGroup for kai-scheduler, Workload for kube-scheduler when supported). These CRs must exist before pods are allowed to be scheduled so the scheduler can enforce gang/topology semantics.

Abstraction layer bridging Grove and specific schedulers:
- **Backend Manager**: Singleton that initializes and provides access to active backend
- **KAI Backend**: Implementation for KAI scheduler (creates PodGroup CRs in future)
- **Kube Backend**: Minimal implementation for default kube-scheduler (no custom CRs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you not be creating the Workload object if GangScheduling is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If GangScheduling is enable, that means kube scheduler support GangScheduling(Workload API), then grove will create the Workload object to leverage kube scheduler GangScheduling feature.

kangclzjc and others added 5 commits February 19, 2026 08:12
Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
@kangclzjc kangclzjc force-pushed the grep_scheduler_backend branch from 16e1c09 to ef9a34e Compare February 19, 2026 01:24

#### Layer 4: Scheduler Layer
Kubernetes schedulers that actually place pods:
- **KAI Scheduler**: Gang scheduling with topology awareness
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either we just mention that the backend schedulers in the scheduling layer will be responsible for providing supporting features like gang scheduling, topology aware packing, gang preemption etc.. What you mention is only part of some features for KAI and none for Kube Scheduler.


For detailed lifecycle flow, see [PodGang Lifecycle Changes](#podgang-lifecycle-changes).

### Backend Interface Definition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to Scheduler Backend Interface

PreparePod(pod *corev1.Pod)

// ValidatePodCliqueSet validates a PodCliqueSet for this scheduler backend.
// Called by the PodCliqueSet validation webhook (create and update). Backends can perform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// ValidatePodCliqueSet provides an ability to the scheduler backends to run additional
// validations on the PodCliqueSet resource. For example - if a scheduler does not yet support
// topology aware placements and if the PodCliqueSet has defined required topology pack constraints
// then it can choose to reject the PodCliqueSet by returning an error.

}
```

**Future note:** Cluster topology (e.g. multi-cluster topology support) may require its own extension point or additional methods on this interface; the interface is expected to evolve as those needs are clarified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point lacks context and is therefore quite unclear.
If it is not a goal of this GREP then it should be added as a non-goal


### Backend Manager

The manager initializes scheduler backends: the kube-scheduler backend is always created and active; additional backends are created from OperatorConfiguration profiles. It provides access by name and a default:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entirely correct. It will initialize the enabled scheduler backends. This component does not assume a default as that is the job of the OperatorConfiguration. The defaulting happens there and not here.

4. **Update PodGang** with PodReferences once all pods are created, and set `Initialized=True`.
5. **Scheduling gates removed** to allow pods to be scheduled. The scheduler uses the backend-created CRs (PodGroup/Workload) when placing pods.

#### New PodGang Status Condition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a sub-heading under Revised PodGang Creation Flow just after it:

Revised PodGang Creation Flow

To understand the new PodGang creation flow, we first introduce the enhancements made to the PodGangStatus.

PodGang API enhancements

A new metav1.Condition has been introduced for PodGang.

const (
	// PodGangConditionTypeInitialized indicates that the PodGang has been populated
	// with pod references and pods can lift scheduling gates.
	PodGangConditionTypeInitialized PodGangConditionType = "Initialized"
)

A PodGang is considered as Initialized when:

  • All constituent Pods are created.
  • Pods back-reference to their PodGang via a grove.io/podgang label.
  • PodGang.Spec.PodGroups have PodReferences fully populated.

NOTE: Field PodReferences in PodGang.Spec.PodGroups is subject to change. If it does then this GREP will need to be updated accordingly.

Creation Flow

< here you define the creation flow >

| Status | Reason | Description |
| --------- | ----------------------------------- | ------------------------------------------------------------ |
| `True` | `AllPodsCreated` | All pods have been created and references populated |
| `False` | `PodsNotCreated` | Waiting for all pods to be created and wait for all pods references to be filled in PodGang|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to revisit this reason. Since the description is overloaded with 2 different reasons.


Unit tests will be implemented for all framework related components:

**Backend Interface and Registry** (`operator/internal/schedulerBackend/`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list of tests will go stale in no time. Do you have a better suggestion?


#### E2E Tests

All existing e2e tests should be passed based on all supported schedulers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you miss here is changes that need to be done for E2E which today always assumes a specific scheduler backend (KAI). Currently there is no way to configure that.

#### Alpha
- Core backend interface defined and implemented
- Backend registry functional
- Basic operator configuration support
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you add KAI implementation to Alpha?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GREP: add scheduler backend framework Add Native Support for Kubernetes Workload API to Enable Gang Scheduling

4 participants

Comments