feat: add Topology Aware Scheduling proposal by julienmancuso · Pull Request #69 · ai-dynamo/enhancements

julienmancuso · 2026-02-05T21:58:41Z

No description provided.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

galrevach

Added a few comments throughout the file - would appreciate your feedback

galrevach · 2026-02-09T09:57:17Z

deps/0013-topology-aware-scheduling.md

+
+### REQ 1 Explicit Opt-In
+
+TAS **MUST** be opt-in via a boolean `enabled` field (default `false`). Existing DGDs without topology constraints **MUST** continue to work unchanged after an operator upgrade. This avoids violating Grove's immutability rules on existing PCS resources.


As far as I know, Grove had a default behavior of setting preferred topology constraint on all its components (PCS, PCSG, and PodClique). Does having this "enabled" flag means this default behavior in Grove will be done only when set to true?

Grove does not apply any default topology constraints. The PCS defaulting webhook has no topology-related logic, and the validating webhook actually rejects any PCS that has topology constraints when TAS is disabled at the operator level. If a PCS is created without TopologyConstraint on any level (PCS/PCSG/PodClique), none are injected, the workload simply doesn't get topology-aware scheduling.
The enabled flag in this proposal controls whether Dynamo injects topology constraints into the PCS it generates. When enabled: true, Dynamo applies smart defaults (PCS=block, PCSG=rack, etc.) and merges any user overrides before producing the PCS. When enabled: false (the default), Dynamo generates the PCS with no topology constraints at all, exactly as it does today.

So to clarify:

Grove: Topology constraints are opt-in at the PCS spec level. No defaults are ever applied by Grove.

Dynamo (this proposal): The enabled flag controls whether Dynamo populates those PCS fields. Smart defaults are a Dynamo-layer concept, not a Grove one.

As far as I understand Grove doesn't plan to add option for Preferred constraints, and will have the option to auto-apply preferred constraints at all levels. See the discussion here: ai-dynamo/grove#368
Can we align with Grove on this?

I think Grove will keep topology packing as opt-in. However it is an open question whether that topology packing should by default result in the scheduler trying to pack everything as close as possible, or if it should just obey specified requirements, or if that in itself should be a switch. This is something we should discuss and close out soon @sanjaychatterjee

Was a decision made here?

galrevach · 2026-02-09T09:59:33Z

deps/0013-topology-aware-scheduling.md

+
+### REQ 3 Smart Defaults
+
+When `enabled: true` and no explicit constraints are provided, the operator **MUST** apply sensible defaults:


Applying these by default means these are hard/required constraints, correct? aren't we concerned it will fail the workload? I think it should be considered if Grove's default behavior of adding "preferred" constraint is enough

let me add the following clarifications to this doc:
1 - These will be configurable at dynamo operator level (let me add
2 - as soon as Grove will support "preferred" then yes we could think of setting these default as preferred.

Also users are the one choosing whether or not they want default applied and if they don't want to, they can just disable the feature

updated in 8f68ff0

I have a similar concern, it essentially requires that:

Grove TAS is enabled with label to domain mapping defined

That the block and rack domains exist in the Grove ClusterTopology

Also block and rack domains varies from cluster to cluster and optimal workload topology placement will vary from workload to workload. This feels like something difficult to have a good default for.

Enabling cluster wide re-mapping of the defaults does help circumvent 2) above but my concern still remains. Would like to hear @sanjaychatterjee @nvrohanv thoughts.

The more I think about it the more I lean towards not setting any defaults for now. For an opt-in feature, defaults feel more like a nice to have than a requirement so I think we should err on the side of making sure our defaults don't mess anything up. Once Grove has support for auto-packing everything I think we could make that the default.

ok let me remove it from proposal

galrevach · 2026-02-09T10:05:55Z

deps/0013-topology-aware-scheduling.md

+## Validation (Webhook)
+
+- **Enabled flag:** If `enabled=false` and constraints are specified → reject.
+- **Struct:** `PackDomain` **MUST** be set and be one of: region, zone, datacenter, block, rack, host, numa.


how are these values defined? does it depend on Grove's cluster topology to be defined with these exact levels?

correct, dynamo provides users to set these well-known topology domain, but under the hoow there must exist a topology in the cluster which exists with these domains.

are these the values in Grove's cluster topology? doesn't it mean it is not agnostic to any topology that is being defined?

Good point, the hardcoded domain list is indeed Grove's vocabulary, which contradicts our framework agnosticism goal.
To address this, the allowed domain list will be configurable at the Dynamo operator level (via Helm values), not hardcoded in the webhook. For Grove, the Helm chart ships with Grove's standard domains as defaults.
For a different framework (e.g., Kueue), the admin overrides allowedDomains to match that framework's supported domain names.
This gives us early validation at the DGD webhook level (catching typos before they hit the framework) while keeping Dynamo decoupled from any specific framework's domain vocabulary. Updated the proposal accordingly.

updated in 22335b8

galrevach · 2026-02-09T10:11:19Z

deps/0013-topology-aware-scheduling.md

@@ -0,0 +1,256 @@
+# Topology Aware Scheduling Integration for Dynamo Operator


I couldn't find in the design a reference to supporting multiple topologies.
In Run:ai, we support multiple topologies for multiple node pools in the same cluster (e.g. H100 vs. GB200). therefore, once a Dynamo workload is submitted through Run:ai, we want to provide the required topology definition based on the node pool on which it Runs.
The issue for Grove: ai-dynamo/grove#369

updated in 8f68ff0

actually I removed the multi-topology req for now, will be introduced later

so the plan is to wait for Grove to complete the implementation and then add it here?

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

tmonty12 · 2026-02-10T17:27:49Z

deps/0013-topology-aware-scheduling.md

+**User example (with named topology):**
+
+```yaml
+spec:


Instead of having a top-level serviceOverrides field, did we consider having topologyConstraint at the service level? This is similar to how Grove formatting and I think more consistent with our field conventions:

spec: topologyConstraints: enabled: true topologyName: aws-p5 # optional; omit for framework default packDomain: zone services: - name: VllmWorker multinode: { nodeCount: 4 } replicas: 2 topologyConstraint: packDomain: block - name: Frontend replicas: 1 topologyConstraint: packDomain: zone

updated in 22335b8

tmonty12 · 2026-02-10T17:35:42Z

deps/0013-topology-aware-scheduling.md

+
+Service-level constraints **MUST** be equal to or narrower than the deployment-level constraint. The operator **MUST** reject a DGD where a service constraint is broader than the deployment constraint.
+
+### REQ 6 Framework Agnosticism


This might be difficult to achieve. In the current proposal we are tightly coupled to Grove's topologyDomain list.

This is an example of TAS enablement for LWS with Kueue where the required topology is set as labels - there is no intermediate CR to convert from domains to labels. I think the current proposal is fine - but just wanted to point this out.

To add to this, in Run:ai we allow the admin to configure the network topology which is a list of labels. then users can request the constraints based on the topology labels defined for this cluster

Good observation. You're right that the domain names (rack, block, etc.) come from Grove's vocabulary today. However, abstract domain names seem to be the right user-facing API, users should say "rack" not topology.kubernetes.io/rack.
If Dynamo later supports a different backend (e.g., Kueue/LWS), the translation from abstract domains to framework-specific values (node labels for Kueue, domain names for Grove) can be done via operator configuration (configmap, another CRD, ...) with domain-to-label mappings per framework. The DGD API would stay the same and only the translation layer would change.
For now, since Dynamo only generates Grove PCS, this works as-is. We can add a note in the open questions/future work section about generalizing the domain vocabulary for non-Grove backends.

wdyt ?

updated in 22335b8

Ya I think this makes sense @julienmancuso, for now we cant really avoid some coupling but in the future we can do translation at the operator level like you said.

I want to add that for KAI/Run:ai the topology CRD is also without domain names, as in Kueue.

If using these specific domain names I can't see how this fits the option for the user to provide the topology name.
The topology name is what defines which topology levels can be selected as constraints. if the list of topology levels if a closed list of domain names (rack, block, etc.), how do we plan this to work?

the translation from dynamo domain to framework-specific values can be done via a domain-to-label mappings per framework.

i added that in 453f2af

I think using the abstract vocabulary makes it specific because only Grove knows how to translate from these domain names to the relevant node labels. the more generic approach is using the labels. I do understand why we want to abstract the labels for the users, but not sure how this solution makes this generic

Using labels directly would actually make the API less generic: labels are cluster-specific (e.g., cloud.provider.com/topology-rack vs topology.kubernetes.io/rack), so DGDs wouldn't be portable across clusters. It also wouldn't work with Grove, since Grove's PackDomain takes domain names, not labels, we'd need to reverse-lookup labels to domain names, which is fragile.

The abstract domains (rack, block, zone) are Dynamo's own vocabulary, they're generic at the user-facing API level. Users express intent in natural terms without needing to know cluster-specific labels. For Grove, these map 1:1 with no translation. For future backends like Kueue that use labels, the operator can translate internally via a domainToLabelMapping config; the DGD API stays the same. The genericness is at the API layer; the framework-specific translation is an internal operator concern.

tmonty12 · 2026-02-10T17:44:27Z

deps/0013-topology-aware-scheduling.md

+
+## Validation (Webhook)
+
+- **Enabled flag:** If `enabled=false` and (`topologyName` set, or constraints specified) → reject.


Might want to break this out into CREATE vs. UPDATE. My understanding is that topologyConstraint at the Grove level is immutable and therefore we would also constrain it as immutable.

updated in 22335b8

tmonty12 · 2026-02-10T17:56:56Z

deps/0013-topology-aware-scheduling.md

+
+### REQ 3 Smart Defaults
+
+When `enabled: true` and no explicit constraints are provided, the operator **MUST** apply sensible defaults:


I have a similar concern, it essentially requires that:

Grove TAS is enabled with label to domain mapping defined

That the block and rack domains exist in the Grove ClusterTopology

Also block and rack domains varies from cluster to cluster and optimal workload topology placement will vary from workload to workload. This feels like something difficult to have a good default for.

Enabling cluster wide re-mapping of the defaults does help circumvent 2) above but my concern still remains. Would like to hear @sanjaychatterjee @nvrohanv thoughts.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

nvrohanv

Overall looks good I think we just need to settle on the defaults/if we are going to have defaults.

deps/0013-topology-aware-scheduling.md

nvrohanv · 2026-02-12T04:44:50Z

deps/0013-topology-aware-scheduling.md

+
+* Allow DGD users to opt in to topology-aware scheduling via a simple, framework-agnostic API.
+
+* Provide smart defaults (PCS=block, PCSG=rack) so users do not have to configure every level.


I'm worried if large deployments will exceed a block and if we should instead have something like zone.

defaults are configurable, users can either change the default or override the constraint in their dgd directly

deps/0013-topology-aware-scheduling.md

nvrohanv · 2026-02-12T04:56:36Z

deps/0013-topology-aware-scheduling.md

+
+### REQ 1 Explicit Opt-In
+
+TAS **MUST** be opt-in via a boolean `enabled` field (default `false`). Existing DGDs without topology constraints **MUST** continue to work unchanged after an operator upgrade. This avoids violating Grove's immutability rules on existing PCS resources.


I think Grove will keep topology packing as opt-in. However it is an open question whether that topology packing should by default result in the scheduler trying to pack everything as close as possible, or if it should just obey specified requirements, or if that in itself should be a switch. This is something we should discuss and close out soon @sanjaychatterjee

nvrohanv · 2026-02-12T05:01:24Z

deps/0013-topology-aware-scheduling.md

+
+### REQ 3 Smart Defaults
+
+When `enabled: true` and no explicit constraints are provided, the operator **MUST** apply sensible defaults:


The more I think about it the more I lean towards not setting any defaults for now. For an opt-in feature, defaults feel more like a nice to have than a requirement so I think we should err on the side of making sure our defaults don't mess anything up. Once Grove has support for auto-packing everything I think we could make that the default.

nvrohanv · 2026-02-12T05:09:14Z

deps/0013-topology-aware-scheduling.md

+
+Service-level constraints **MUST** be equal to or narrower than the deployment-level constraint. The operator **MUST** reject a DGD where a service constraint is broader than the deployment constraint.
+
+### REQ 6 Framework Agnosticism


Ya I think this makes sense @julienmancuso, for now we cant really avoid some coupling but in the future we can do translation at the operator level like you said.

galrevach · 2026-02-12T07:28:02Z

deps/0013-topology-aware-scheduling.md

+- Multinode services (PCSG): `packDomain: rack`; child cliques inherit from PCSG (no explicit constraint)
+- Single-node services (PodClique): `packDomain: rack`
+
+Both the smart defaults and the list of allowed topology domain names **MUST** be configurable at the Dynamo operator level (e.g., via Helm values or operator configuration), so that cluster administrators can tune them per cluster without requiring users to modify their DGDs. For Grove, the Helm defaults ship with `[region, zone, datacenter, block, rack, host, numa]`; for other frameworks, the admin overrides the list to match the supported domains.


what does this mean exactly? that the Dynamo admin will also need to define the levels of topology (e.g. rack, block...)? I think this can be confusing for admins to define the topology structure (CRD) on Grove, Dynamo, and Run:ai (in cases it is being used).
This is why we suggested to allow the user to add the topology name on the workload spec. this name will let Grove and the Scheduler to know which topology CRD to use, and the constraints requested will be based on this specific topology CRD

galrevach · 2026-02-12T07:31:09Z

deps/0013-topology-aware-scheduling.md

+
+Service-level constraints **MUST** be equal to or narrower than the deployment-level constraint. The operator **MUST** reject a DGD where a service constraint is broader than the deployment constraint.
+
+### REQ 6 Framework Agnosticism


I want to add that for KAI/Run:ai the topology CRD is also without domain names, as in Kueue.

If using these specific domain names I can't see how this fits the option for the user to provide the topology name.
The topology name is what defines which topology levels can be selected as constraints. if the list of topology levels if a closed list of domain names (rack, block, etc.), how do we plan this to work?

Co-authored-by: Rohan Varma <rohanv@nvidia.com> Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

Ronkahn21 · 2026-02-17T08:34:05Z

deps/0013-topology-aware-scheduling.md

+    Enabled bool `json:"enabled"`
+
+    // Deployment-level constraint. Required when enabled=true.
+    Deployment *TopologyConstraint `json:"deployment,omitempty"`


Since this is already in a deployment context, should we rename this field to topologyConstraints for better clarity?

Ronkahn21 · 2026-02-17T08:39:45Z

deps/0013-topology-aware-scheduling.md

+```go
+type TopologyConstraints struct {
+    // Enabled controls whether TAS is applied. Default: false.
+    Enabled bool `json:"enabled"`


Is the Enabled flag necessary? Couldn't we just infer that TAS is enabled if the constraints are provided?

Ronkahn21 · 2026-02-17T08:45:02Z

deps/0013-topology-aware-scheduling.md

+
+## Status and Observability
+
+- **Status:** Add `AppliedTopologyConstraints` to DGD status reporting the applied constraints when `enabled=true`.


Currently, we allow changes to the Cluster Topology CR even if it's in use. If a user requests a constraint that no longer exists, enforcement stops. Shouldn't Dynamo propagate this scenario in the status so the user knows why the constraints are no longer being enforced?

Good point, this is a real scenario. However, looks like this is a grove-level responsibility: if the ClusterTopology is mutated in a way that breaks active PCS constraints, Grove should report that on the PCS status (e.g., a degraded condition). Dynamo already propagates PCS status/conditions to DGD status, so any issue Grove surfaces on the PCS will be visible at the DGD level. Having Dynamo independently watch the ClusterTopology CR for drift would tightly couple the controller to Grove internals and duplicate responsibility. If Grove doesn't currently surface this, it might be worth raising as a Grove enhancement.

Dynamo doesn't need to watch ClusterTopology for drift. Grove handles this by reporting a TopologyLevelsUnavailable condition on the PCS whenever a required topology level is removed

ok yes that makes sense, i updated the doc

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

Ronkahn21 · 2026-02-17T21:20:33Z

deps/0013-topology-aware-scheduling.md

 * Allow DGD users to opt in to topology-aware scheduling via a simple, framework-agnostic API.

-* Require explicit deployment-level and per-service topology constraints when TAS is enabled (no defaults).
+* Require explicit deployment-level and per-service topology constraints when TAS is used (no defaults). TAS is in effect when `spec.topologyConstraint` is set.


If user set constrains only at the service level, does it mean the feature is enabled
or it always required to set the spec.topologyConstraint to enable the feature

full setup should be there (deployment + all services) otherwise we would reject it

Could you clarify the rationale behind making this a strict requirement?

which i realized is not correct, I'm going to relax the constraint, the feature will be enabled only when 1 of the deployment or either service has topology enabled

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

feat: add Topology Aware Scheduling proposal

c05907c

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

galrevach reviewed Feb 9, 2026

View reviewed changes

feat: add Topology Aware Scheduling proposal

8f68ff0

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

julienmancuso requested a review from galrevach February 9, 2026 18:39

tmonty12 reviewed Feb 10, 2026

View reviewed changes

feat: add Topology Aware Scheduling proposal

22335b8

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

julienmancuso requested a review from tmonty12 February 11, 2026 21:00

nvrohanv reviewed Feb 12, 2026

View reviewed changes

galrevach reviewed Feb 12, 2026

View reviewed changes

gal-revach mentioned this pull request Feb 12, 2026

Support API for requesting Preferred topology constraint ai-dynamo/grove#368

Open

julienmancuso and others added 3 commits February 12, 2026 07:58

Update deps/0013-topology-aware-scheduling.md

f0829ef

Co-authored-by: Rohan Varma <rohanv@nvidia.com> Signed-off-by: Julien Mancuso <161955438+julienmancuso@users.noreply.github.com>

feat: removing multi topology req

453f2af

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

feat: removing smart default req

9430181

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

Ronkahn21 reviewed Feb 17, 2026

View reviewed changes

feat: add Topology Aware Scheduling proposal

e4883f4

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

Ronkahn21 reviewed Feb 17, 2026

View reviewed changes

julienmancuso added 2 commits February 17, 2026 14:57

feat: add Topology Aware Scheduling proposal

73a6828

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

feat: add Topology Aware Scheduling proposal

9d00b73

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

julienmancuso requested review from Ronkahn21, galrevach and nvrohanv February 18, 2026 15:14

julienmancuso mentioned this pull request Feb 18, 2026

feat: add Topology Aware Scheduling support ai-dynamo/dynamo#6375

Open


		### REQ 1 Explicit Opt-In

		TAS MUST be opt-in via a boolean `enabled` field (default `false`). Existing DGDs without topology constraints MUST continue to work unchanged after an operator upgrade. This avoids violating Grove's immutability rules on existing PCS resources.


		### REQ 3 Smart Defaults

		When `enabled: true` and no explicit constraints are provided, the operator MUST apply sensible defaults:

		@@ -0,0 +1,256 @@
		# Topology Aware Scheduling Integration for Dynamo Operator


		Service-level constraints MUST be equal to or narrower than the deployment-level constraint. The operator MUST reject a DGD where a service constraint is broader than the deployment constraint.

		### REQ 6 Framework Agnosticism


		## Validation (Webhook)

		- Enabled flag: If `enabled=false` and (`topologyName` set, or constraints specified) → reject.


		* Allow DGD users to opt in to topology-aware scheduling via a simple, framework-agnostic API.

		* Provide smart defaults (PCS=block, PCSG=rack) so users do not have to configure every level.


		## Status and Observability

		- Status: Add `AppliedTopologyConstraints` to DGD status reporting the applied constraints when `enabled=true`.

Conversation

julienmancuso commented Feb 5, 2026

Uh oh!

galrevach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvrohanv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!