Skip to content

feat: add Topology Aware Scheduling proposal#69

Open
julienmancuso wants to merge 9 commits intomainfrom
jsm/tas
Open

feat: add Topology Aware Scheduling proposal#69
julienmancuso wants to merge 9 commits intomainfrom
jsm/tas

Conversation

@julienmancuso
Copy link

No description provided.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Copy link

@galrevach galrevach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments throughout the file - would appreciate your feedback


### REQ 1 Explicit Opt-In

TAS **MUST** be opt-in via a boolean `enabled` field (default `false`). Existing DGDs without topology constraints **MUST** continue to work unchanged after an operator upgrade. This avoids violating Grove's immutability rules on existing PCS resources.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, Grove had a default behavior of setting preferred topology constraint on all its components (PCS, PCSG, and PodClique). Does having this "enabled" flag means this default behavior in Grove will be done only when set to true?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grove does not apply any default topology constraints. The PCS defaulting webhook has no topology-related logic, and the validating webhook actually rejects any PCS that has topology constraints when TAS is disabled at the operator level. If a PCS is created without TopologyConstraint on any level (PCS/PCSG/PodClique), none are injected, the workload simply doesn't get topology-aware scheduling.
The enabled flag in this proposal controls whether Dynamo injects topology constraints into the PCS it generates. When enabled: true, Dynamo applies smart defaults (PCS=block, PCSG=rack, etc.) and merges any user overrides before producing the PCS. When enabled: false (the default), Dynamo generates the PCS with no topology constraints at all, exactly as it does today.

So to clarify:

  • Grove: Topology constraints are opt-in at the PCS spec level. No defaults are ever applied by Grove.
  • Dynamo (this proposal): The enabled flag controls whether Dynamo populates those PCS fields. Smart defaults are a Dynamo-layer concept, not a Grove one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand Grove doesn't plan to add option for Preferred constraints, and will have the option to auto-apply preferred constraints at all levels. See the discussion here: ai-dynamo/grove#368
Can we align with Grove on this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Grove will keep topology packing as opt-in. However it is an open question whether that topology packing should by default result in the scheduler trying to pack everything as close as possible, or if it should just obey specified requirements, or if that in itself should be a switch. This is something we should discuss and close out soon @sanjaychatterjee

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a decision made here?


### REQ 3 Smart Defaults

When `enabled: true` and no explicit constraints are provided, the operator **MUST** apply sensible defaults:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying these by default means these are hard/required constraints, correct? aren't we concerned it will fail the workload? I think it should be considered if Grove's default behavior of adding "preferred" constraint is enough

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me add the following clarifications to this doc:
1 - These will be configurable at dynamo operator level (let me add
2 - as soon as Grove will support "preferred" then yes we could think of setting these default as preferred.

Also users are the one choosing whether or not they want default applied and if they don't want to, they can just disable the feature

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 8f68ff0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar concern, it essentially requires that:

  1. Grove TAS is enabled with label to domain mapping defined
  2. That the block and rack domains exist in the Grove ClusterTopology

Also block and rack domains varies from cluster to cluster and optimal workload topology placement will vary from workload to workload. This feels like something difficult to have a good default for.

Enabling cluster wide re-mapping of the defaults does help circumvent 2) above but my concern still remains. Would like to hear @sanjaychatterjee @nvrohanv thoughts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it the more I lean towards not setting any defaults for now. For an opt-in feature, defaults feel more like a nice to have than a requirement so I think we should err on the side of making sure our defaults don't mess anything up. Once Grove has support for auto-packing everything I think we could make that the default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let me remove it from proposal

## Validation (Webhook)

- **Enabled flag:** If `enabled=false` and constraints are specified → reject.
- **Struct:** `PackDomain` **MUST** be set and be one of: region, zone, datacenter, block, rack, host, numa.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are these values defined? does it depend on Grove's cluster topology to be defined with these exact levels?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, dynamo provides users to set these well-known topology domain, but under the hoow there must exist a topology in the cluster which exists with these domains.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these the values in Grove's cluster topology? doesn't it mean it is not agnostic to any topology that is being defined?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the hardcoded domain list is indeed Grove's vocabulary, which contradicts our framework agnosticism goal.
To address this, the allowed domain list will be configurable at the Dynamo operator level (via Helm values), not hardcoded in the webhook. For Grove, the Helm chart ships with Grove's standard domains as defaults.
For a different framework (e.g., Kueue), the admin overrides allowedDomains to match that framework's supported domain names.
This gives us early validation at the DGD webhook level (catching typos before they hit the framework) while keeping Dynamo decoupled from any specific framework's domain vocabulary. Updated the proposal accordingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 22335b8

@@ -0,0 +1,256 @@
# Topology Aware Scheduling Integration for Dynamo Operator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find in the design a reference to supporting multiple topologies.
In Run:ai, we support multiple topologies for multiple node pools in the same cluster (e.g. H100 vs. GB200). therefore, once a Dynamo workload is submitted through Run:ai, we want to provide the required topology definition based on the node pool on which it Runs.
The issue for Grove: ai-dynamo/grove#369

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 8f68ff0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I removed the multi-topology req for now, will be introduced later

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the plan is to wait for Grove to complete the implementation and then add it here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
**User example (with named topology):**

```yaml
spec:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a top-level serviceOverrides field, did we consider having topologyConstraint at the service level? This is similar to how Grove formatting and I think more consistent with our field conventions:

spec:
  topologyConstraints:
    enabled: true
    topologyName: aws-p5   # optional; omit for framework default
    packDomain: zone
  services:
    - name: VllmWorker
      multinode: { nodeCount: 4 }
      replicas: 2
      topologyConstraint:
        packDomain: block
    - name: Frontend
      replicas: 1
      topologyConstraint:
        packDomain: zone

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 22335b8


Service-level constraints **MUST** be equal to or narrower than the deployment-level constraint. The operator **MUST** reject a DGD where a service constraint is broader than the deployment constraint.

### REQ 6 Framework Agnosticism

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be difficult to achieve. In the current proposal we are tightly coupled to Grove's topologyDomain list.

This is an example of TAS enablement for LWS with Kueue where the required topology is set as labels - there is no intermediate CR to convert from domains to labels. I think the current proposal is fine - but just wanted to point this out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this, in Run:ai we allow the admin to configure the network topology which is a list of labels. then users can request the constraints based on the topology labels defined for this cluster

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. You're right that the domain names (rack, block, etc.) come from Grove's vocabulary today. However, abstract domain names seem to be the right user-facing API, users should say "rack" not topology.kubernetes.io/rack.
If Dynamo later supports a different backend (e.g., Kueue/LWS), the translation from abstract domains to framework-specific values (node labels for Kueue, domain names for Grove) can be done via operator configuration (configmap, another CRD, ...) with domain-to-label mappings per framework. The DGD API would stay the same and only the translation layer would change.
For now, since Dynamo only generates Grove PCS, this works as-is. We can add a note in the open questions/future work section about generalizing the domain vocabulary for non-Grove backends.

wdyt ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 22335b8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think this makes sense @julienmancuso, for now we cant really avoid some coupling but in the future we can do translation at the operator level like you said.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add that for KAI/Run:ai the topology CRD is also without domain names, as in Kueue.

If using these specific domain names I can't see how this fits the option for the user to provide the topology name.
The topology name is what defines which topology levels can be selected as constraints. if the list of topology levels if a closed list of domain names (rack, block, etc.), how do we plan this to work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the translation from dynamo domain to framework-specific values can be done via a domain-to-label mappings per framework.

i added that in 453f2af

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the abstract vocabulary makes it specific because only Grove knows how to translate from these domain names to the relevant node labels. the more generic approach is using the labels. I do understand why we want to abstract the labels for the users, but not sure how this solution makes this generic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using labels directly would actually make the API less generic: labels are cluster-specific (e.g., cloud.provider.com/topology-rack vs topology.kubernetes.io/rack), so DGDs wouldn't be portable across clusters. It also wouldn't work with Grove, since Grove's PackDomain takes domain names, not labels, we'd need to reverse-lookup labels to domain names, which is fragile.

The abstract domains (rack, block, zone) are Dynamo's own vocabulary, they're generic at the user-facing API level. Users express intent in natural terms without needing to know cluster-specific labels. For Grove, these map 1:1 with no translation. For future backends like Kueue that use labels, the operator can translate internally via a domainToLabelMapping config; the DGD API stays the same. The genericness is at the API layer; the framework-specific translation is an internal operator concern.


## Validation (Webhook)

- **Enabled flag:** If `enabled=false` and (`topologyName` set, or constraints specified) → reject.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to break this out into CREATE vs. UPDATE. My understanding is that topologyConstraint at the Grove level is immutable and therefore we would also constrain it as immutable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 22335b8


### REQ 3 Smart Defaults

When `enabled: true` and no explicit constraints are provided, the operator **MUST** apply sensible defaults:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar concern, it essentially requires that:

  1. Grove TAS is enabled with label to domain mapping defined
  2. That the block and rack domains exist in the Grove ClusterTopology

Also block and rack domains varies from cluster to cluster and optimal workload topology placement will vary from workload to workload. This feels like something difficult to have a good default for.

Enabling cluster wide re-mapping of the defaults does help circumvent 2) above but my concern still remains. Would like to hear @sanjaychatterjee @nvrohanv thoughts.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Copy link

@nvrohanv nvrohanv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good I think we just need to settle on the defaults/if we are going to have defaults.


* Allow DGD users to opt in to topology-aware scheduling via a simple, framework-agnostic API.

* Provide smart defaults (PCS=block, PCSG=rack) so users do not have to configure every level.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried if large deployments will exceed a block and if we should instead have something like zone.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaults are configurable, users can either change the default or override the constraint in their dgd directly


### REQ 1 Explicit Opt-In

TAS **MUST** be opt-in via a boolean `enabled` field (default `false`). Existing DGDs without topology constraints **MUST** continue to work unchanged after an operator upgrade. This avoids violating Grove's immutability rules on existing PCS resources.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Grove will keep topology packing as opt-in. However it is an open question whether that topology packing should by default result in the scheduler trying to pack everything as close as possible, or if it should just obey specified requirements, or if that in itself should be a switch. This is something we should discuss and close out soon @sanjaychatterjee


### REQ 3 Smart Defaults

When `enabled: true` and no explicit constraints are provided, the operator **MUST** apply sensible defaults:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it the more I lean towards not setting any defaults for now. For an opt-in feature, defaults feel more like a nice to have than a requirement so I think we should err on the side of making sure our defaults don't mess anything up. Once Grove has support for auto-packing everything I think we could make that the default.


Service-level constraints **MUST** be equal to or narrower than the deployment-level constraint. The operator **MUST** reject a DGD where a service constraint is broader than the deployment constraint.

### REQ 6 Framework Agnosticism

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think this makes sense @julienmancuso, for now we cant really avoid some coupling but in the future we can do translation at the operator level like you said.

- Multinode services (PCSG): `packDomain: rack`; child cliques inherit from PCSG (no explicit constraint)
- Single-node services (PodClique): `packDomain: rack`

Both the smart defaults and the list of allowed topology domain names **MUST** be configurable at the Dynamo operator level (e.g., via Helm values or operator configuration), so that cluster administrators can tune them per cluster without requiring users to modify their DGDs. For Grove, the Helm defaults ship with `[region, zone, datacenter, block, rack, host, numa]`; for other frameworks, the admin overrides the list to match the supported domains.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean exactly? that the Dynamo admin will also need to define the levels of topology (e.g. rack, block...)? I think this can be confusing for admins to define the topology structure (CRD) on Grove, Dynamo, and Run:ai (in cases it is being used).
This is why we suggested to allow the user to add the topology name on the workload spec. this name will let Grove and the Scheduler to know which topology CRD to use, and the constraints requested will be based on this specific topology CRD


Service-level constraints **MUST** be equal to or narrower than the deployment-level constraint. The operator **MUST** reject a DGD where a service constraint is broader than the deployment constraint.

### REQ 6 Framework Agnosticism

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add that for KAI/Run:ai the topology CRD is also without domain names, as in Kueue.

If using these specific domain names I can't see how this fits the option for the user to provide the topology name.
The topology name is what defines which topology levels can be selected as constraints. if the list of topology levels if a closed list of domain names (rack, block, etc.), how do we plan this to work?

julienmancuso and others added 3 commits February 12, 2026 07:58
Co-authored-by: Rohan Varma <rohanv@nvidia.com>
Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Enabled bool `json:"enabled"`

// Deployment-level constraint. Required when enabled=true.
Deployment *TopologyConstraint `json:"deployment,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is already in a deployment context, should we rename this field to topologyConstraints for better clarity?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

```go
type TopologyConstraints struct {
// Enabled controls whether TAS is applied. Default: false.
Enabled bool `json:"enabled"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Enabled flag necessary? Couldn't we just infer that TAS is enabled if the constraints are provided?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


## Status and Observability

- **Status:** Add `AppliedTopologyConstraints` to DGD status reporting the applied constraints when `enabled=true`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we allow changes to the Cluster Topology CR even if it's in use. If a user requests a constraint that no longer exists, enforcement stops. Shouldn't Dynamo propagate this scenario in the status so the user knows why the constraints are no longer being enforced?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this is a real scenario. However, looks like this is a grove-level responsibility: if the ClusterTopology is mutated in a way that breaks active PCS constraints, Grove should report that on the PCS status (e.g., a degraded condition). Dynamo already propagates PCS status/conditions to DGD status, so any issue Grove surfaces on the PCS will be visible at the DGD level. Having Dynamo independently watch the ClusterTopology CR for drift would tightly couple the controller to Grove internals and duplicate responsibility. If Grove doesn't currently surface this, it might be worth raising as a Grove enhancement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamo doesn't need to watch ClusterTopology for drift. Grove handles this by reporting a TopologyLevelsUnavailable condition on the PCS whenever a required topology level is removed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yes that makes sense, i updated the doc

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
* Allow DGD users to opt in to topology-aware scheduling via a simple, framework-agnostic API.

* Require explicit deployment-level and per-service topology constraints when TAS is enabled (no defaults).
* Require explicit deployment-level and per-service topology constraints when TAS is used (no defaults). TAS is in effect when `spec.topologyConstraint` is set.
Copy link

@Ronkahn21 Ronkahn21 Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user set constrains only at the service level, does it mean the feature is enabled
or it always required to set the spec.topologyConstraint to enable the feature

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full setup should be there (deployment + all services) otherwise we would reject it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the rationale behind making this a strict requirement?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which i realized is not correct, I'm going to relax the constraint, the feature will be enabled only when 1 of the deployment or either service has topology enabled

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants