GREP-375 add scheduler backend framework#372
GREP-375 add scheduler backend framework#372kangclzjc wants to merge 46 commits intoai-dynamo:mainfrom
Conversation
Signed-off-by: kangclzjc <[email protected]>
Signed-off-by: kangclzjc <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: kangclzjc <[email protected]>
Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
remove useless words Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: kangclzjc <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
remove phase1 in limitation Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
|
|
||
| For detailed lifecycle flow, see [PodGang Lifecycle Changes](#podgang-lifecycle-changes). | ||
|
|
||
| ### Backend Interface Definition |
There was a problem hiding this comment.
The interface currently omits the relationship between ClusterTopology and secondary resources. How do you envision the navigational link from the main topology to other specific Topology CRDs?
There was a problem hiding this comment.
Yes, this is a good point. Per my understanding, for each scheduler backend, we should first define the mapping once the backend Initiation, and then we have several hooks like: PreparePod for modify topology label in spec, also SycPodGang hook to translate Topology to specific Topology in other CRDs.
There was a problem hiding this comment.
today we dont have controller for Cluster Topology, we would add it as part of multi cluster topology support,
so it might need extension point of its own
There was a problem hiding this comment.
Yes, I add an future note in GREP.
|
|
||
| #### New Flow (With Framework): | ||
| 1. **Create PodGang early** with PodGroups having empty PodReferences and `Initialized=False` | ||
| 2. **Create Pods** (with scheduling gates to block scheduling) |
There was a problem hiding this comment.
Could we do this without using the scheduling Gate, In large scale this would be intensive to modify all the pods spec to remove the scheduling Gate
There was a problem hiding this comment.
I agree with u. If we could refine this scheduling gate, that's would be a good enhancement. Maybe we should raise this question and discuss it in another GREP?
There was a problem hiding this comment.
Maybe different question would would happend if would not use the scheduling gate at all (beside what we do today)
There was a problem hiding this comment.
pods would be schedulable as soon as they’re created and couldn't guarantee gang schedule
There was a problem hiding this comment.
You're right. In large scale this would be intensive to modify all the pods spec to remove the scheduling Gate. I will create another issue to track this.
Move scheduler string to struct Co-authored-by: Ron Kahn <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
b819afc to
4959656
Compare
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
7012497 to
3ac9845
Compare
Signed-off-by: Kang Zhang <[email protected]>
sanjaychatterjee
left a comment
There was a problem hiding this comment.
LGTM. Made a couple of minor suggestions to update the GREP. Thanks!
| 2. Wait for all pods to have back-references to PodGang | ||
| 3. Create PodGang with complete PodReferences | ||
|
|
||
| #### New Flow (With Framework): |
There was a problem hiding this comment.
Can you please clarify the flow when the scheduler backend will create a scheduler specific CRs for the workload, e.g. PodGroup for KAI, or Workload for kube-scheduler?
There was a problem hiding this comment.
Yes, added. After PodGang created, this backend will create a scheduler specific CR like below.
1. **Create PodGang early** with PodGroups having empty PodReferences and `Initialized=False`.
2. **Backend creates scheduler-specific CRs**: The Backend Controller reconciles the new PodGang and calls `SyncPodGang()` on the resolved backend. The backend creates or updates its scheduler-specific resources (e.g. PodGroup for kai-scheduler, Workload for kube-scheduler when supported). These CRs must exist before pods are allowed to be scheduled so the scheduler can enforce gang/topology semantics.| Abstraction layer bridging Grove and specific schedulers: | ||
| - **Backend Manager**: Singleton that initializes and provides access to active backend | ||
| - **KAI Backend**: Implementation for KAI scheduler (creates PodGroup CRs in future) | ||
| - **Kube Backend**: Minimal implementation for default kube-scheduler (no custom CRs) |
There was a problem hiding this comment.
Would you not be creating the Workload object if GangScheduling is enabled?
There was a problem hiding this comment.
If GangScheduling is enable, that means kube scheduler support GangScheduling(Workload API), then grove will create the Workload object to leverage kube scheduler GangScheduling feature.
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
16e1c09 to
ef9a34e
Compare
Signed-off-by: Kang Zhang <[email protected]>
|
|
||
| #### Layer 4: Scheduler Layer | ||
| Kubernetes schedulers that actually place pods: | ||
| - **KAI Scheduler**: Gang scheduling with topology awareness |
There was a problem hiding this comment.
either we just mention that the backend schedulers in the scheduling layer will be responsible for providing supporting features like gang scheduling, topology aware packing, gang preemption etc.. What you mention is only part of some features for KAI and none for Kube Scheduler.
|
|
||
| For detailed lifecycle flow, see [PodGang Lifecycle Changes](#podgang-lifecycle-changes). | ||
|
|
||
| ### Backend Interface Definition |
There was a problem hiding this comment.
Rename this to Scheduler Backend Interface
| PreparePod(pod *corev1.Pod) | ||
|
|
||
| // ValidatePodCliqueSet validates a PodCliqueSet for this scheduler backend. | ||
| // Called by the PodCliqueSet validation webhook (create and update). Backends can perform |
There was a problem hiding this comment.
// ValidatePodCliqueSet provides an ability to the scheduler backends to run additional
// validations on the PodCliqueSet resource. For example - if a scheduler does not yet support
// topology aware placements and if the PodCliqueSet has defined required topology pack constraints
// then it can choose to reject the PodCliqueSet by returning an error.
| } | ||
| ``` | ||
|
|
||
| **Future note:** Cluster topology (e.g. multi-cluster topology support) may require its own extension point or additional methods on this interface; the interface is expected to evolve as those needs are clarified. |
There was a problem hiding this comment.
This point lacks context and is therefore quite unclear.
If it is not a goal of this GREP then it should be added as a non-goal
|
|
||
| ### Backend Manager | ||
|
|
||
| The manager initializes scheduler backends: the kube-scheduler backend is always created and active; additional backends are created from OperatorConfiguration profiles. It provides access by name and a default: |
There was a problem hiding this comment.
This is not entirely correct. It will initialize the enabled scheduler backends. This component does not assume a default as that is the job of the OperatorConfiguration. The defaulting happens there and not here.
| 4. **Update PodGang** with PodReferences once all pods are created, and set `Initialized=True`. | ||
| 5. **Scheduling gates removed** to allow pods to be scheduled. The scheduler uses the backend-created CRs (PodGroup/Workload) when placing pods. | ||
|
|
||
| #### New PodGang Status Condition |
There was a problem hiding this comment.
Create a sub-heading under Revised PodGang Creation Flow just after it:
Revised PodGang Creation Flow
To understand the new PodGang creation flow, we first introduce the enhancements made to the PodGangStatus.
PodGang API enhancements
A new metav1.Condition has been introduced for PodGang.
const (
// PodGangConditionTypeInitialized indicates that the PodGang has been populated
// with pod references and pods can lift scheduling gates.
PodGangConditionTypeInitialized PodGangConditionType = "Initialized"
)A PodGang is considered as Initialized when:
- All constituent
Pods are created. - Pods back-reference to their
PodGangvia agrove.io/podganglabel. PodGang.Spec.PodGroupshavePodReferencesfully populated.
NOTE: Field
PodReferencesinPodGang.Spec.PodGroupsis subject to change. If it does then this GREP will need to be updated accordingly.
Creation Flow
< here you define the creation flow >
| | Status | Reason | Description | | ||
| | --------- | ----------------------------------- | ------------------------------------------------------------ | | ||
| | `True` | `AllPodsCreated` | All pods have been created and references populated | | ||
| | `False` | `PodsNotCreated` | Waiting for all pods to be created and wait for all pods references to be filled in PodGang| |
There was a problem hiding this comment.
Need to revisit this reason. Since the description is overloaded with 2 different reasons.
|
|
||
| Unit tests will be implemented for all framework related components: | ||
|
|
||
| **Backend Interface and Registry** (`operator/internal/schedulerBackend/`) |
There was a problem hiding this comment.
This list of tests will go stale in no time. Do you have a better suggestion?
|
|
||
| #### E2E Tests | ||
|
|
||
| All existing e2e tests should be passed based on all supported schedulers. |
There was a problem hiding this comment.
What you miss here is changes that need to be done for E2E which today always assumes a specific scheduler backend (KAI). Currently there is no way to configure that.
| #### Alpha | ||
| - Core backend interface defined and implemented | ||
| - Backend registry functional | ||
| - Basic operator configuration support |
There was a problem hiding this comment.
Should you add KAI implementation to Alpha?
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Add scheduler backend framework to support multiple scheduler backends
Which issue(s) this PR fixes:
Fixes #275
Fixes #375
Special notes for your reviewer:
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: