feat: add Topology Aware Scheduling proposal#69
Conversation
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
galrevach
left a comment
There was a problem hiding this comment.
Added a few comments throughout the file - would appreciate your feedback
|
|
||
| ### REQ 1 Explicit Opt-In | ||
|
|
||
| TAS **MUST** be opt-in via a boolean `enabled` field (default `false`). Existing DGDs without topology constraints **MUST** continue to work unchanged after an operator upgrade. This avoids violating Grove's immutability rules on existing PCS resources. |
There was a problem hiding this comment.
As far as I know, Grove had a default behavior of setting preferred topology constraint on all its components (PCS, PCSG, and PodClique). Does having this "enabled" flag means this default behavior in Grove will be done only when set to true?
There was a problem hiding this comment.
Grove does not apply any default topology constraints. The PCS defaulting webhook has no topology-related logic, and the validating webhook actually rejects any PCS that has topology constraints when TAS is disabled at the operator level. If a PCS is created without TopologyConstraint on any level (PCS/PCSG/PodClique), none are injected, the workload simply doesn't get topology-aware scheduling.
The enabled flag in this proposal controls whether Dynamo injects topology constraints into the PCS it generates. When enabled: true, Dynamo applies smart defaults (PCS=block, PCSG=rack, etc.) and merges any user overrides before producing the PCS. When enabled: false (the default), Dynamo generates the PCS with no topology constraints at all, exactly as it does today.
So to clarify:
- Grove: Topology constraints are opt-in at the PCS spec level. No defaults are ever applied by Grove.
- Dynamo (this proposal): The enabled flag controls whether Dynamo populates those PCS fields. Smart defaults are a Dynamo-layer concept, not a Grove one.
There was a problem hiding this comment.
As far as I understand Grove doesn't plan to add option for Preferred constraints, and will have the option to auto-apply preferred constraints at all levels. See the discussion here: ai-dynamo/grove#368
Can we align with Grove on this?
There was a problem hiding this comment.
I think Grove will keep topology packing as opt-in. However it is an open question whether that topology packing should by default result in the scheduler trying to pack everything as close as possible, or if it should just obey specified requirements, or if that in itself should be a switch. This is something we should discuss and close out soon @sanjaychatterjee
|
|
||
| ### REQ 3 Smart Defaults | ||
|
|
||
| When `enabled: true` and no explicit constraints are provided, the operator **MUST** apply sensible defaults: |
There was a problem hiding this comment.
Applying these by default means these are hard/required constraints, correct? aren't we concerned it will fail the workload? I think it should be considered if Grove's default behavior of adding "preferred" constraint is enough
There was a problem hiding this comment.
let me add the following clarifications to this doc:
1 - These will be configurable at dynamo operator level (let me add
2 - as soon as Grove will support "preferred" then yes we could think of setting these default as preferred.
Also users are the one choosing whether or not they want default applied and if they don't want to, they can just disable the feature
There was a problem hiding this comment.
I have a similar concern, it essentially requires that:
- Grove TAS is enabled with label to domain mapping defined
- That the
blockandrackdomains exist in the GroveClusterTopology
Also block and rack domains varies from cluster to cluster and optimal workload topology placement will vary from workload to workload. This feels like something difficult to have a good default for.
Enabling cluster wide re-mapping of the defaults does help circumvent 2) above but my concern still remains. Would like to hear @sanjaychatterjee @nvrohanv thoughts.
There was a problem hiding this comment.
The more I think about it the more I lean towards not setting any defaults for now. For an opt-in feature, defaults feel more like a nice to have than a requirement so I think we should err on the side of making sure our defaults don't mess anything up. Once Grove has support for auto-packing everything I think we could make that the default.
There was a problem hiding this comment.
ok let me remove it from proposal
| ## Validation (Webhook) | ||
|
|
||
| - **Enabled flag:** If `enabled=false` and constraints are specified → reject. | ||
| - **Struct:** `PackDomain` **MUST** be set and be one of: region, zone, datacenter, block, rack, host, numa. |
There was a problem hiding this comment.
how are these values defined? does it depend on Grove's cluster topology to be defined with these exact levels?
There was a problem hiding this comment.
correct, dynamo provides users to set these well-known topology domain, but under the hoow there must exist a topology in the cluster which exists with these domains.
There was a problem hiding this comment.
are these the values in Grove's cluster topology? doesn't it mean it is not agnostic to any topology that is being defined?
There was a problem hiding this comment.
Good point, the hardcoded domain list is indeed Grove's vocabulary, which contradicts our framework agnosticism goal.
To address this, the allowed domain list will be configurable at the Dynamo operator level (via Helm values), not hardcoded in the webhook. For Grove, the Helm chart ships with Grove's standard domains as defaults.
For a different framework (e.g., Kueue), the admin overrides allowedDomains to match that framework's supported domain names.
This gives us early validation at the DGD webhook level (catching typos before they hit the framework) while keeping Dynamo decoupled from any specific framework's domain vocabulary. Updated the proposal accordingly.
| @@ -0,0 +1,256 @@ | |||
| # Topology Aware Scheduling Integration for Dynamo Operator | |||
There was a problem hiding this comment.
I couldn't find in the design a reference to supporting multiple topologies.
In Run:ai, we support multiple topologies for multiple node pools in the same cluster (e.g. H100 vs. GB200). therefore, once a Dynamo workload is submitted through Run:ai, we want to provide the required topology definition based on the node pool on which it Runs.
The issue for Grove: ai-dynamo/grove#369
There was a problem hiding this comment.
actually I removed the multi-topology req for now, will be introduced later
There was a problem hiding this comment.
so the plan is to wait for Grove to complete the implementation and then add it here?
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
| **User example (with named topology):** | ||
|
|
||
| ```yaml | ||
| spec: |
There was a problem hiding this comment.
Instead of having a top-level serviceOverrides field, did we consider having topologyConstraint at the service level? This is similar to how Grove formatting and I think more consistent with our field conventions:
spec:
topologyConstraints:
enabled: true
topologyName: aws-p5 # optional; omit for framework default
packDomain: zone
services:
- name: VllmWorker
multinode: { nodeCount: 4 }
replicas: 2
topologyConstraint:
packDomain: block
- name: Frontend
replicas: 1
topologyConstraint:
packDomain: zone|
|
||
| Service-level constraints **MUST** be equal to or narrower than the deployment-level constraint. The operator **MUST** reject a DGD where a service constraint is broader than the deployment constraint. | ||
|
|
||
| ### REQ 6 Framework Agnosticism |
There was a problem hiding this comment.
This might be difficult to achieve. In the current proposal we are tightly coupled to Grove's topologyDomain list.
This is an example of TAS enablement for LWS with Kueue where the required topology is set as labels - there is no intermediate CR to convert from domains to labels. I think the current proposal is fine - but just wanted to point this out.
There was a problem hiding this comment.
To add to this, in Run:ai we allow the admin to configure the network topology which is a list of labels. then users can request the constraints based on the topology labels defined for this cluster
There was a problem hiding this comment.
Good observation. You're right that the domain names (rack, block, etc.) come from Grove's vocabulary today. However, abstract domain names seem to be the right user-facing API, users should say "rack" not topology.kubernetes.io/rack.
If Dynamo later supports a different backend (e.g., Kueue/LWS), the translation from abstract domains to framework-specific values (node labels for Kueue, domain names for Grove) can be done via operator configuration (configmap, another CRD, ...) with domain-to-label mappings per framework. The DGD API would stay the same and only the translation layer would change.
For now, since Dynamo only generates Grove PCS, this works as-is. We can add a note in the open questions/future work section about generalizing the domain vocabulary for non-Grove backends.
wdyt ?
There was a problem hiding this comment.
Ya I think this makes sense @julienmancuso, for now we cant really avoid some coupling but in the future we can do translation at the operator level like you said.
There was a problem hiding this comment.
I want to add that for KAI/Run:ai the topology CRD is also without domain names, as in Kueue.
If using these specific domain names I can't see how this fits the option for the user to provide the topology name.
The topology name is what defines which topology levels can be selected as constraints. if the list of topology levels if a closed list of domain names (rack, block, etc.), how do we plan this to work?
There was a problem hiding this comment.
the translation from dynamo domain to framework-specific values can be done via a domain-to-label mappings per framework.
i added that in 453f2af
There was a problem hiding this comment.
I think using the abstract vocabulary makes it specific because only Grove knows how to translate from these domain names to the relevant node labels. the more generic approach is using the labels. I do understand why we want to abstract the labels for the users, but not sure how this solution makes this generic
There was a problem hiding this comment.
Using labels directly would actually make the API less generic: labels are cluster-specific (e.g., cloud.provider.com/topology-rack vs topology.kubernetes.io/rack), so DGDs wouldn't be portable across clusters. It also wouldn't work with Grove, since Grove's PackDomain takes domain names, not labels, we'd need to reverse-lookup labels to domain names, which is fragile.
The abstract domains (rack, block, zone) are Dynamo's own vocabulary, they're generic at the user-facing API level. Users express intent in natural terms without needing to know cluster-specific labels. For Grove, these map 1:1 with no translation. For future backends like Kueue that use labels, the operator can translate internally via a domainToLabelMapping config; the DGD API stays the same. The genericness is at the API layer; the framework-specific translation is an internal operator concern.
|
|
||
| ## Validation (Webhook) | ||
|
|
||
| - **Enabled flag:** If `enabled=false` and (`topologyName` set, or constraints specified) → reject. |
There was a problem hiding this comment.
Might want to break this out into CREATE vs. UPDATE. My understanding is that topologyConstraint at the Grove level is immutable and therefore we would also constrain it as immutable.
|
|
||
| ### REQ 3 Smart Defaults | ||
|
|
||
| When `enabled: true` and no explicit constraints are provided, the operator **MUST** apply sensible defaults: |
There was a problem hiding this comment.
I have a similar concern, it essentially requires that:
- Grove TAS is enabled with label to domain mapping defined
- That the
blockandrackdomains exist in the GroveClusterTopology
Also block and rack domains varies from cluster to cluster and optimal workload topology placement will vary from workload to workload. This feels like something difficult to have a good default for.
Enabling cluster wide re-mapping of the defaults does help circumvent 2) above but my concern still remains. Would like to hear @sanjaychatterjee @nvrohanv thoughts.
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
nvrohanv
left a comment
There was a problem hiding this comment.
Overall looks good I think we just need to settle on the defaults/if we are going to have defaults.
|
|
||
| * Allow DGD users to opt in to topology-aware scheduling via a simple, framework-agnostic API. | ||
|
|
||
| * Provide smart defaults (PCS=block, PCSG=rack) so users do not have to configure every level. |
There was a problem hiding this comment.
I'm worried if large deployments will exceed a block and if we should instead have something like zone.
There was a problem hiding this comment.
defaults are configurable, users can either change the default or override the constraint in their dgd directly
|
|
||
| ### REQ 1 Explicit Opt-In | ||
|
|
||
| TAS **MUST** be opt-in via a boolean `enabled` field (default `false`). Existing DGDs without topology constraints **MUST** continue to work unchanged after an operator upgrade. This avoids violating Grove's immutability rules on existing PCS resources. |
There was a problem hiding this comment.
I think Grove will keep topology packing as opt-in. However it is an open question whether that topology packing should by default result in the scheduler trying to pack everything as close as possible, or if it should just obey specified requirements, or if that in itself should be a switch. This is something we should discuss and close out soon @sanjaychatterjee
|
|
||
| ### REQ 3 Smart Defaults | ||
|
|
||
| When `enabled: true` and no explicit constraints are provided, the operator **MUST** apply sensible defaults: |
There was a problem hiding this comment.
The more I think about it the more I lean towards not setting any defaults for now. For an opt-in feature, defaults feel more like a nice to have than a requirement so I think we should err on the side of making sure our defaults don't mess anything up. Once Grove has support for auto-packing everything I think we could make that the default.
|
|
||
| Service-level constraints **MUST** be equal to or narrower than the deployment-level constraint. The operator **MUST** reject a DGD where a service constraint is broader than the deployment constraint. | ||
|
|
||
| ### REQ 6 Framework Agnosticism |
There was a problem hiding this comment.
Ya I think this makes sense @julienmancuso, for now we cant really avoid some coupling but in the future we can do translation at the operator level like you said.
| - Multinode services (PCSG): `packDomain: rack`; child cliques inherit from PCSG (no explicit constraint) | ||
| - Single-node services (PodClique): `packDomain: rack` | ||
|
|
||
| Both the smart defaults and the list of allowed topology domain names **MUST** be configurable at the Dynamo operator level (e.g., via Helm values or operator configuration), so that cluster administrators can tune them per cluster without requiring users to modify their DGDs. For Grove, the Helm defaults ship with `[region, zone, datacenter, block, rack, host, numa]`; for other frameworks, the admin overrides the list to match the supported domains. |
There was a problem hiding this comment.
what does this mean exactly? that the Dynamo admin will also need to define the levels of topology (e.g. rack, block...)? I think this can be confusing for admins to define the topology structure (CRD) on Grove, Dynamo, and Run:ai (in cases it is being used).
This is why we suggested to allow the user to add the topology name on the workload spec. this name will let Grove and the Scheduler to know which topology CRD to use, and the constraints requested will be based on this specific topology CRD
|
|
||
| Service-level constraints **MUST** be equal to or narrower than the deployment-level constraint. The operator **MUST** reject a DGD where a service constraint is broader than the deployment constraint. | ||
|
|
||
| ### REQ 6 Framework Agnosticism |
There was a problem hiding this comment.
I want to add that for KAI/Run:ai the topology CRD is also without domain names, as in Kueue.
If using these specific domain names I can't see how this fits the option for the user to provide the topology name.
The topology name is what defines which topology levels can be selected as constraints. if the list of topology levels if a closed list of domain names (rack, block, etc.), how do we plan this to work?
Co-authored-by: Rohan Varma <rohanv@nvidia.com> Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
| Enabled bool `json:"enabled"` | ||
|
|
||
| // Deployment-level constraint. Required when enabled=true. | ||
| Deployment *TopologyConstraint `json:"deployment,omitempty"` |
There was a problem hiding this comment.
Since this is already in a deployment context, should we rename this field to topologyConstraints for better clarity?
| ```go | ||
| type TopologyConstraints struct { | ||
| // Enabled controls whether TAS is applied. Default: false. | ||
| Enabled bool `json:"enabled"` |
There was a problem hiding this comment.
Is the Enabled flag necessary? Couldn't we just infer that TAS is enabled if the constraints are provided?
|
|
||
| ## Status and Observability | ||
|
|
||
| - **Status:** Add `AppliedTopologyConstraints` to DGD status reporting the applied constraints when `enabled=true`. |
There was a problem hiding this comment.
Currently, we allow changes to the Cluster Topology CR even if it's in use. If a user requests a constraint that no longer exists, enforcement stops. Shouldn't Dynamo propagate this scenario in the status so the user knows why the constraints are no longer being enforced?
There was a problem hiding this comment.
Good point, this is a real scenario. However, looks like this is a grove-level responsibility: if the ClusterTopology is mutated in a way that breaks active PCS constraints, Grove should report that on the PCS status (e.g., a degraded condition). Dynamo already propagates PCS status/conditions to DGD status, so any issue Grove surfaces on the PCS will be visible at the DGD level. Having Dynamo independently watch the ClusterTopology CR for drift would tightly couple the controller to Grove internals and duplicate responsibility. If Grove doesn't currently surface this, it might be worth raising as a Grove enhancement.
There was a problem hiding this comment.
Dynamo doesn't need to watch ClusterTopology for drift. Grove handles this by reporting a TopologyLevelsUnavailable condition on the PCS whenever a required topology level is removed
There was a problem hiding this comment.
ok yes that makes sense, i updated the doc
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
| * Allow DGD users to opt in to topology-aware scheduling via a simple, framework-agnostic API. | ||
|
|
||
| * Require explicit deployment-level and per-service topology constraints when TAS is enabled (no defaults). | ||
| * Require explicit deployment-level and per-service topology constraints when TAS is used (no defaults). TAS is in effect when `spec.topologyConstraint` is set. |
There was a problem hiding this comment.
If user set constrains only at the service level, does it mean the feature is enabled
or it always required to set the spec.topologyConstraint to enable the feature
There was a problem hiding this comment.
full setup should be there (deployment + all services) otherwise we would reject it
There was a problem hiding this comment.
Could you clarify the rationale behind making this a strict requirement?
There was a problem hiding this comment.
which i realized is not correct, I'm going to relax the constraint, the feature will be enabled only when 1 of the deployment or either service has topology enabled
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
No description provided.