doc: add design proposal for topology-aware cluster selection by WMP · Pull Request #5986 · ceph/ceph-csi

WMP · 2026-01-28T08:45:08Z

Describe what this PR does

Design proposal for topology-aware multi-cluster volume provisioning.
This enables the CSI driver to dynamically select the appropriate Ceph
cluster at CreateVolume time based on the node's topology zone.

The proposal introduces two new configuration mechanisms:

topologyDomainLabels field in config.json cluster entries — associates
each cluster with Kubernetes topology labels
clusterIDs StorageClass parameter — a comma-separated list of candidate
cluster IDs for topology-based selection

This is a design-only PR. Implementation will follow in a separate PR
once the design is approved.

Is there anything that requires special attention

The design is fully backward compatible. Existing configs and
StorageClasses with a single clusterID work unchanged.
volumeBindingMode: WaitForFirstConsumer is required for topology-based
selection (Kubernetes must provide AccessibilityRequirements).
The existing clusterID parameter takes priority when present —
topology selection is only used as a fallback via the new clusterIDs
parameter.

Related issues

Ref: #5177

Future concerns

Making clusterID fully optional when clusterIDs is provided
Combining topology-based cluster selection with topologyConstrainedPools
for selecting both cluster and pool
E2E tests with multi-cluster topology setup

Checklist:

Commit Message Formatting: follows developer guide
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

Add a design document describing topology-aware multi-cluster volume provisioning. This enables the CSI driver to dynamically select the appropriate Ceph cluster at CreateVolume time based on the node's topology zone. The proposal introduces two configuration mechanisms: - topologyDomainLabels field in config.json cluster entries - clusterIDs StorageClass parameter (comma-separated list) Ref: ceph#5177 Signed-off-by: Marcin Janowski <[email protected]>

Rakshith-R · 2026-01-29T08:57:42Z

Hey @WMP ,

Thanks for the contributions !

Please note that all of this code is written by Claude, and at this point, it has not been built or tested. This is only a proposal to add this functionality. At the moment, it is really just vibe coding. However, if you agree that the idea of implementing this functionality is appropriate, I will continue my work and perform tests on live ceph and k8s clusters.

We would love to review and accept contributions.
Reviewing designs and code consumes a lot of effort.

Before you provide design on a new feature, have you tested and understood the current working of topology based provisioning in k8s, csi and cephcsi as of today ?

This would help you understand the feature and review design for proposed improvements yourself in depth.

It would certainly boost our confidence for reviewing this design document.

nixpanic · 2026-02-26T15:56:54Z

Sounds like a nice feature to me. There are Rook users that deploy/maintain different Ceph clusters at different locations, but still use a single large Kubernetes cluster.

@travisn might be interested in this feature too? Any input/feedback is appreciated.

travisn · 2026-02-26T19:50:48Z

There is already topology provisioning for a single ceph cluster. For reference, see this example for topology-based provisioning of an external cluster. The approach is similar for an internal ceph cluster as well, though we don't have that example in the docs currently.

Trying to support multiple ceph clusters from a single storage class sounds to me like a big change for the csi driver to support, although I'll let others decide on that. But let's be clear about the needed scenario. Is the need for scale, multi-tenancy, or other?

I would be surprised if Ceph didn't already scale enough to support all the storage needs for a single K8s cluster. And the existing topology-based provisioning based on pools in a single ceph cluster can already handle multi-tenancy. Separate pools can already be defined using device classes.

moskalev · 2026-03-17T05:12:11Z

Please allow me to describe our use case for multiple ceph cluster support. I hope this provides motivation of why this feature could be useful to the community.

We run our infrastructure in several independent deployments that we call "zones". Each zone has all components (hardware/software/config) necessary to be able to stand on its own. While currently zones are connected via high-bandwidth low-latency links, we make no strict assumptions of that being true going into the future (e.g. zone migration to a different data center may raise latency/decrease cross-zone throughput). Kubernetes clusters span multiple zones for redundancy reasons. We require the apps (inside k8s) to remain functional even when a whole zone is offline. Zones' configuration management is structured in a way such that worst case scenario errors would only be affecting a single zone at a time.

If we were to run a single ceph cluster spanning multiple zones, we would encounter the following trade offs (please correct me if I misunderstand the topologyConstrainedPools use case):

A simple single ceph cluster across multiple zones would require us to have high-bandwidth links at all time and, potentially, a separate set of those since we would need to decouple ceph's and applications' traffic. That gets costly quickly for geographically separated zones and makes zone migration much less feasible.
A ceph cluster with topology constraint pools could be a potential option here, but that would still break the "zone standing on its own" assumption (if I understand correctly, we would have to configure ceph in stretch mode to regain the ability to lose a zone; this greatly increases the configuration complexity) and that would open a possibility for a potential single misconfiguration to critically affect applications in all zones simultaneously.

To keep the zones independent and CRUSH maps simple, we went with the multi-cluster approach. This also helps us in rolling out upgrades to our ceph clusters with little stress at an expense of a few more repetitions.

travisn · 2026-03-17T17:46:25Z

There are two statements that seem conflicting:

"Each zone has all components (hardware/software/config) necessary to be able to stand on its own."
"Kubernetes clusters span multiple zones for redundancy reasons. We require the apps (inside k8s) to remain functional even when a whole zone is offline."

How do you expect to have each zone be independent, and also apps remain functional even when a zone is offline? I would expect that if the apps are to remain online, their data must also be online, which would require the data to be replicated across multiple zones.

moskalev · 2026-03-17T18:33:27Z

I apologize for confusion here.

The independence of a zone is not at the application level. "software" here means hypervisor layer, not the end-user's application.

You can think of a zone as a rack of machines: it spans multiple physical machines "vertically". A zone can function on its own as a hosting platform for VMs/k8s-nodes. A zone does not care about the end-user apps. A zone's goal is to ensure that VMs are running with as little downtime as possible and with as much performance as possible. We consider a zone to be a failure domain for the apps (the end-user apps know about zones, but zones are not aware about apps and do not have a goal of singlehandedly supporting an end-user app).

k8s clusters span zones "horizontaly" - e.g. multiple physical machines from different racks (in different zones) host VMs that are a part of a k8s cluster. k8s clusters are highly available - control plane/ingress/load balancing are distributed accross all zones and are functional even when one zone is offline.

"We require the apps (inside k8s) to remain functional even when a whole zone is offline." means that it's the end-user app's responsibility to ensure that its state is properly synchronized between the zones, not the storage layer's task. For other storage backends, like local NVMes - it is very easy to expose them as a storage class in such cluster even when VMs are in different zones. We would like ceph to be that another backend - local to each zone, but exposed as a single storage class within k8s cluster for the apps to use.

Also, some of the apps do not even require state synchronization (you can think of them as jobs), but they need some storage (e.g. scratch space) to function. An end-user app with many worker pods just needs enough capacity (roughly equal across all zones) to remain functional and may not even care what zone to launch a new pod in as long as there is capacity. We would want to make it an easy problem by utilizing the same storage class across all zones.

travisn · 2026-03-17T20:12:32Z

Do the applications have their own redundancy, thus do not require the storage to be replicated across zones?

In that case, you can create a single storage class with topology-awareness using device classes and Ceph pools for each device class on OSDs in separate zones. Did you read the topology-awareness example I linked in the previous comment?

moskalev · 2026-03-17T20:27:10Z

Yes, end-user applications are responsible for their own data replication/redundancy.

This would require a single ceph cluster which is exactly what we are trying to avoid. I do not see where the example covers a case of a single storage class composed from multiple ceph clusters.

travisn · 2026-03-17T22:04:48Z

Yes, that example uses a single ceph cluster, but the Ceph pools are only on top of OSDs in a specific zone based on the device classes. It uses a single storage class with zone-aware provisioning, and avoids cross-zone traffic for the data, which seems to match your goals.

moskalev · 2026-03-19T02:17:09Z

Unfortunately, a single ceph cluster means that:

zones are no longer independent. Let me clarify what I mean by this:
-- ceph uses monitors to orchestrate access to OSDs by ceph clients. This implies that we need to run multiple monitors (in different zones) and those monitors must form quorum and agree before a client can access OSDs. Then, if a zone gets severed from the other zones, it is no longer able to work with ceph-backed storage volumes (for VMs or k8s pods). I do not think that a local pool in the zone solves that problem, since coordination layer (ceph monitors) is not local to the zone.
-- even when everything is functioning normally, geographically distant zones will introduce latency. I do not have much experience with running ceph clusters over large distances, but I would imagine that extra latency and potential network jitter may negatively affect cluster's health (e.g. mon_clock_drift_allowed has a default of 50ms meaning that east to west coast latency will likely be "unhealthy" for such setups). Since ceph benefits so much from fast (both in throughput and latency) network, then why not give it the best conditions by keeping ceph cluster entirely within a single zone?
a single ceph cluster becomes a single point of failure. Why:
-- any error or misconfiguration of ceph cluster will likely affect ceph clients in all zones.
-- any upgrade gone wrong will affect ceph clients in all zones.
-- any potential data loss would mean that even an end-user application implementing proper cross-zone redundancy could still lose data.

For example, with separate ceph clusters (one per zone), we can do staggered upgrade rollouts. This minimizes the blast radius if something goes wrong. While unlikely, in some rare cases upgrades can be dangerous (see issue 5772 for ceph-csi).

I hope that illustrates why we would not like to setup ceph clusters spanning multiple zones.

travisn · 2026-03-19T18:20:33Z

I agree it would be better to have separate Ceph clusters in geographically separate regions. At the same time, do you not have the same concerns for K8s with etcd?
I understand @dimm0 has successfully spanned K8s and Ceph across high latency zones, perhaps he can add some insight with that experience.

In any case, I'm not against this feature request, was just trying to understand the scenario more deeply and if it could be accomplished without changes.

dimm0 · 2026-03-20T03:37:34Z

Yes we're running a global k8s cluster with multiple ceph clusters under rook that are more local. Also I agree that more automation on picking the right storage class (if I read the ticket correctly) would be helpful.

moskalev · 2026-04-02T20:18:44Z

Etcd can indeed be problematic when latency rises with geographically distributed setups. Yet, this is less of the problem for us since k8s clusters can be built fairly "thin". That is, there is no significant resource requirement to run a k8s cluster, which makes it possible to run many clusters based on the project/app/team. This greatly reduces the load on k8s control plane and potential etcd issues are less pronounced and have limited scope when they appear.

Ceph, on the other hand, is resource heavy in the sense that it benefits from increasing the number of OSDs and hosts in a cluster. That makes it impractical to run multiple ceph clusters that each span multiple zones (i.e. ceph clusters would each follow a respective k8s cluster in all zones where the k8s cluster is present which would reduce the load on individual ceph cluster's coordination layer like in the paragraph above). The better approach, from our point of view, is to run separate ceph clusters in different zones (zone acting as a failure/connectivity domain with low latency and high bandwidth) and integrate them all into each k8s cluster we run.

I understand that introducing new features and code changes is not always desired, but I hope that our use case illustrates why there is a demand for a feature like this. Maybe other ceph-csi users can share their examples and potential workarounds. We would also appreciate if maintainers could review the related PR: 6116

WMP · 2026-04-03T20:20:33Z

I want to share the current implementation status and also clarify the exact use case we are solving.

My target is RBD + CephFS, but CephFS is the first implementation. I am posting the current branch now so the community can see the direction of the implementation before I finish the RBD part and rerun the full validation on the exact current branch head.

Why we need this

Today, our current production-like setup consists of three nearby zones (up to roughly 10 km apart) served by one Ceph cluster.

Our target architecture goes further: we plan to add another zone located roughly 2000 km away, with much higher inter-zone latency. This is one of the key reasons why we want Ceph-CSI to support a single topology-aware StorageClass backed by multiple independent Ceph clusters.

The main reasons are:

we want zones to stay operationally independent
we prefer more smaller identical Ceph clusters over one very large cluster
upgrades are much easier and safer when they can be rolled out cluster-by-cluster
the blast radius is smaller if one Ceph cluster has a problem
we want a single StorageClass that works naturally for applications deployed via StatefulSet

This is not about Ceph lacking scale. It is about:

failure-domain isolation
upgrade safety
operational simplicity
keeping storage local to the zone where the pod runs

Application model

Yes, in our case the applications themselves are responsible for redundancy / replication across zones.

Examples are:

Elasticsearch
MinIO
PostgreSQL

Some workloads also just need zone-local persistent or scratch space and do not require storage-level replication across zones.

So the storage layer does not need to provide cross-zone HA here. What we need from Ceph-CSI is:

pick the correct local Ceph cluster for the zone
keep later operations on that same cluster
prevent pods from later landing on nodes that cannot mount that volume

Why this goes beyond #6116

I also looked at #6116. It is useful work, but it is much narrower than the topology problem we are trying to solve.

As I understand it, #6116:

is RBD-only
introduces a separate clusterTopologyConfigMap
resolves provisioning-time cluster selection
stores a selected secret name for that cluster

My branch is broader in scope and addresses additional topology problems that showed up in real validation:

Hard accessibility, not only initial cluster selection
Returning the correct clusterID at CreateVolume time is not enough.
The driver also needs to return AccessibleTopology, so the external-provisioner can create PV nodeAffinity.
Otherwise a pod may later be scheduled onto a node in another zone / another Ceph cluster and fail at mount time.
CephFS support
The current work is CephFS-first, while the final goal is RBD + CephFS.
Per-cluster lifecycle secrets
In a real multi-cluster CephFS deployment, the selected cluster may also require different:
- provisioner secret
- node-stage secret
- controller-expand secret
- fsName
- pool
Correct topology matching even when the same clusterID appears multiple times
In our current environment, one Ceph cluster serves three nearby zones (distance up to roughly 10 km), so the same clusterID appears multiple times with different topology labels.
Matching only by clusterID was not sufficient in that setup.

At the same time, our target architecture goes further: we plan to add a fourth zone located roughly 2000 km away, with much higher latency. That future expansion is one of the main reasons why we do not want to rely on a single Ceph cluster spanning all zones.

So from my perspective, #6116 is not the full solution for this use case. It solves a narrower part of the problem.

What has been validated

The full scenario set has already been validated on a live cluster using a tested diff prepared by my teammate.

That validation covered all of the following:

correct cluster selection by topology
correct handling of repeated clusterID across multiple zones
AccessibleTopology / PV nodeAffinity
pod deletion and remount on the same node
node drain and remount on another node in the same zone
StatefulSet deletion and recreation with PVC reuse
CephFS per-cluster secret handling
backward compatibility on a cluster with only a single Ceph cluster, to verify that existing users can upgrade Ceph-CSI without changing their configuration

Current branch status

The current implementation state is available here: https://github.com/WMP/ceph-csi/tree/feature/topology-aware-volume-accessibility

Important note: the exact current branch head has not yet been rebuilt and rerun through the full live-cluster validation.

The current branch is functionally the same implementation merged/adapted from the already tested diff, but this exact branch state is being shared now mainly so the community can review the implementation direction and progress.

I plan to run the full validation again on the exact current branch state in the next few days, after I finish the RBD support. Functionally, I expect that code path to be close to final.

Summary

So the request here is not simply “multi-cluster because one cluster does not scale”.

It is specifically about:

multiple independent Ceph clusters
zones separated by large distance / high latency
one StorageClass that is easy to use for StatefulSet-based applications
topology-aware provisioning that also remains correct for later lifecycle operations
backward compatibility for existing single-cluster users

If useful, I can also post the implementation branch/commit references here so maintainers can review the current code alongside this design discussion.

Rakshith-R · 2026-04-08T10:04:24Z

Hey,

I think we have established the need to have a single storageclass support topology based provisioning from different ceph clusters.

Let's focus on a unified design and implementation that works not just for RBD and CephFS but for NVME-of, NFS etc.
This should also consider Snaphots and GroupSnapshots too.

Additionally, mounting, cloning, expansion etc, all the ops supported today need to be extended as well.

IMO,

existing storageclass parameter like clusterID should not be overridden to have comma separated values.
make use of the existing configmap, adding additional optional parameters like topologyDomain , topologyID ?

Maybe add topologyID in storageclass, filter by topologyID in configmap and match topologyDomainLabel and use clusterID from that ?

After creation, the PVC will be tied to that well specified clusterID, making other ops on it consistent.

mergify Bot added ci/skip/e2e skip running e2e CI jobs ci/skip/multi-arch-build skip building on multiple architectures component/docs Issues and PRs related to documentation labels Jan 28, 2026

WMP force-pushed the proposal/topology-aware-cluster-selection branch from 76965d0 to 6c1c9d7 Compare January 28, 2026 09:02

Conversation

WMP commented Jan 28, 2026

Describe what this PR does

Is there anything that requires special attention

Related issues

Future concerns

Uh oh!

Rakshith-R commented Jan 29, 2026

Uh oh!

nixpanic commented Feb 26, 2026

Uh oh!

travisn commented Feb 26, 2026

Uh oh!

moskalev commented Mar 17, 2026

Uh oh!

travisn commented Mar 17, 2026

Uh oh!

moskalev commented Mar 17, 2026

Uh oh!

travisn commented Mar 17, 2026

Uh oh!

moskalev commented Mar 17, 2026

Uh oh!

travisn commented Mar 17, 2026

Uh oh!

moskalev commented Mar 19, 2026

Uh oh!

travisn commented Mar 19, 2026

Uh oh!

dimm0 commented Mar 20, 2026

Uh oh!

moskalev commented Apr 2, 2026

Uh oh!

WMP commented Apr 3, 2026

Why we need this

Application model

Why this goes beyond #6116

What has been validated

Current branch status

Summary

Uh oh!

Rakshith-R commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants