doc: add design proposal for topology-aware cluster selection#5986
doc: add design proposal for topology-aware cluster selection#5986WMP wants to merge 1 commit intoceph:develfrom
Conversation
Add a design document describing topology-aware multi-cluster volume provisioning. This enables the CSI driver to dynamically select the appropriate Ceph cluster at CreateVolume time based on the node's topology zone. The proposal introduces two configuration mechanisms: - topologyDomainLabels field in config.json cluster entries - clusterIDs StorageClass parameter (comma-separated list) Ref: ceph#5177 Signed-off-by: Marcin Janowski <[email protected]>
76965d0 to
6c1c9d7
Compare
|
Hey @WMP , Thanks for the contributions !
We would love to review and accept contributions. Before you provide design on a new feature, have you tested and understood the current working of topology based provisioning in k8s, csi and cephcsi as of today ? This would help you understand the feature and review design for proposed improvements yourself in depth. It would certainly boost our confidence for reviewing this design document. |
|
Sounds like a nice feature to me. There are Rook users that deploy/maintain different Ceph clusters at different locations, but still use a single large Kubernetes cluster. @travisn might be interested in this feature too? Any input/feedback is appreciated. |
|
There is already topology provisioning for a single ceph cluster. For reference, see this example for topology-based provisioning of an external cluster. The approach is similar for an internal ceph cluster as well, though we don't have that example in the docs currently. Trying to support multiple ceph clusters from a single storage class sounds to me like a big change for the csi driver to support, although I'll let others decide on that. But let's be clear about the needed scenario. Is the need for scale, multi-tenancy, or other? I would be surprised if Ceph didn't already scale enough to support all the storage needs for a single K8s cluster. And the existing topology-based provisioning based on pools in a single ceph cluster can already handle multi-tenancy. Separate pools can already be defined using device classes. |
|
Please allow me to describe our use case for multiple ceph cluster support. I hope this provides motivation of why this feature could be useful to the community. We run our infrastructure in several independent deployments that we call "zones". Each zone has all components (hardware/software/config) necessary to be able to stand on its own. While currently zones are connected via high-bandwidth low-latency links, we make no strict assumptions of that being true going into the future (e.g. zone migration to a different data center may raise latency/decrease cross-zone throughput). Kubernetes clusters span multiple zones for redundancy reasons. We require the apps (inside k8s) to remain functional even when a whole zone is offline. Zones' configuration management is structured in a way such that worst case scenario errors would only be affecting a single zone at a time. If we were to run a single ceph cluster spanning multiple zones, we would encounter the following trade offs (please correct me if I misunderstand the topologyConstrainedPools use case):
To keep the zones independent and CRUSH maps simple, we went with the multi-cluster approach. This also helps us in rolling out upgrades to our ceph clusters with little stress at an expense of a few more repetitions. |
|
There are two statements that seem conflicting:
How do you expect to have each zone be independent, and also apps remain functional even when a zone is offline? I would expect that if the apps are to remain online, their data must also be online, which would require the data to be replicated across multiple zones. |
|
I apologize for confusion here. The independence of a zone is not at the application level. "software" here means hypervisor layer, not the end-user's application. You can think of a zone as a rack of machines: it spans multiple physical machines "vertically". A zone can function on its own as a hosting platform for VMs/k8s-nodes. A zone does not care about the end-user apps. A zone's goal is to ensure that VMs are running with as little downtime as possible and with as much performance as possible. We consider a zone to be a failure domain for the apps (the end-user apps know about zones, but zones are not aware about apps and do not have a goal of singlehandedly supporting an end-user app). k8s clusters span zones "horizontaly" - e.g. multiple physical machines from different racks (in different zones) host VMs that are a part of a k8s cluster. k8s clusters are highly available - control plane/ingress/load balancing are distributed accross all zones and are functional even when one zone is offline. "We require the apps (inside k8s) to remain functional even when a whole zone is offline." means that it's the end-user app's responsibility to ensure that its state is properly synchronized between the zones, not the storage layer's task. For other storage backends, like local NVMes - it is very easy to expose them as a storage class in such cluster even when VMs are in different zones. We would like ceph to be that another backend - local to each zone, but exposed as a single storage class within k8s cluster for the apps to use. Also, some of the apps do not even require state synchronization (you can think of them as jobs), but they need some storage (e.g. scratch space) to function. An end-user app with many worker pods just needs enough capacity (roughly equal across all zones) to remain functional and may not even care what zone to launch a new pod in as long as there is capacity. We would want to make it an easy problem by utilizing the same storage class across all zones. |
|
Do the applications have their own redundancy, thus do not require the storage to be replicated across zones? In that case, you can create a single storage class with topology-awareness using device classes and Ceph pools for each device class on OSDs in separate zones. Did you read the topology-awareness example I linked in the previous comment? |
|
Yes, end-user applications are responsible for their own data replication/redundancy. This would require a single ceph cluster which is exactly what we are trying to avoid. I do not see where the example covers a case of a single storage class composed from multiple ceph clusters. |
|
Yes, that example uses a single ceph cluster, but the Ceph pools are only on top of OSDs in a specific zone based on the device classes. It uses a single storage class with zone-aware provisioning, and avoids cross-zone traffic for the data, which seems to match your goals. |
|
Unfortunately, a single ceph cluster means that:
For example, with separate ceph clusters (one per zone), we can do staggered upgrade rollouts. This minimizes the blast radius if something goes wrong. While unlikely, in some rare cases upgrades can be dangerous (see issue 5772 for ceph-csi). I hope that illustrates why we would not like to setup ceph clusters spanning multiple zones. |
|
I agree it would be better to have separate Ceph clusters in geographically separate regions. At the same time, do you not have the same concerns for K8s with etcd? In any case, I'm not against this feature request, was just trying to understand the scenario more deeply and if it could be accomplished without changes. |
|
Yes we're running a global k8s cluster with multiple ceph clusters under rook that are more local. Also I agree that more automation on picking the right storage class (if I read the ticket correctly) would be helpful. |
|
Etcd can indeed be problematic when latency rises with geographically distributed setups. Yet, this is less of the problem for us since k8s clusters can be built fairly "thin". That is, there is no significant resource requirement to run a k8s cluster, which makes it possible to run many clusters based on the project/app/team. This greatly reduces the load on k8s control plane and potential etcd issues are less pronounced and have limited scope when they appear. Ceph, on the other hand, is resource heavy in the sense that it benefits from increasing the number of OSDs and hosts in a cluster. That makes it impractical to run multiple ceph clusters that each span multiple zones (i.e. ceph clusters would each follow a respective k8s cluster in all zones where the k8s cluster is present which would reduce the load on individual ceph cluster's coordination layer like in the paragraph above). The better approach, from our point of view, is to run separate ceph clusters in different zones (zone acting as a failure/connectivity domain with low latency and high bandwidth) and integrate them all into each k8s cluster we run. I understand that introducing new features and code changes is not always desired, but I hope that our use case illustrates why there is a demand for a feature like this. Maybe other ceph-csi users can share their examples and potential workarounds. We would also appreciate if maintainers could review the related PR: 6116 |
|
I want to share the current implementation status and also clarify the exact use case we are solving. My target is RBD + CephFS, but CephFS is the first implementation. I am posting the current branch now so the community can see the direction of the implementation before I finish the RBD part and rerun the full validation on the exact current branch head. Why we need thisToday, our current production-like setup consists of three nearby zones (up to roughly 10 km apart) served by one Ceph cluster. Our target architecture goes further: we plan to add another zone located roughly 2000 km away, with much higher inter-zone latency. This is one of the key reasons why we want Ceph-CSI to support a single topology-aware StorageClass backed by multiple independent Ceph clusters. The main reasons are:
This is not about Ceph lacking scale. It is about:
Application modelYes, in our case the applications themselves are responsible for redundancy / replication across zones. Examples are:
Some workloads also just need zone-local persistent or scratch space and do not require storage-level replication across zones. So the storage layer does not need to provide cross-zone HA here. What we need from Ceph-CSI is:
Why this goes beyond #6116I also looked at As I understand it,
My branch is broader in scope and addresses additional topology problems that showed up in real validation:
So from my perspective, What has been validatedThe full scenario set has already been validated on a live cluster using a tested diff prepared by my teammate. That validation covered all of the following:
Current branch statusThe current implementation state is available here: https://github.com/WMP/ceph-csi/tree/feature/topology-aware-volume-accessibility Important note: the exact current branch head has not yet been rebuilt and rerun through the full live-cluster validation. The current branch is functionally the same implementation merged/adapted from the already tested diff, but this exact branch state is being shared now mainly so the community can review the implementation direction and progress. I plan to run the full validation again on the exact current branch state in the next few days, after I finish the RBD support. Functionally, I expect that code path to be close to final. SummarySo the request here is not simply “multi-cluster because one cluster does not scale”. It is specifically about:
If useful, I can also post the implementation branch/commit references here so maintainers can review the current code alongside this design discussion. |
|
Hey, I think we have established the need to have a single storageclass support topology based provisioning from different ceph clusters. Let's focus on a unified design and implementation that works not just for RBD and CephFS but for NVME-of, NFS etc. Additionally, mounting, cloning, expansion etc, all the ops supported today need to be extended as well. IMO,
Maybe add topologyID in storageclass, filter by topologyID in configmap and match topologyDomainLabel and use clusterID from that ? After creation, the PVC will be tied to that well specified clusterID, making other ops on it consistent. |
Describe what this PR does
Design proposal for topology-aware multi-cluster volume provisioning.
This enables the CSI driver to dynamically select the appropriate Ceph
cluster at CreateVolume time based on the node's topology zone.
The proposal introduces two new configuration mechanisms:
topologyDomainLabelsfield in config.json cluster entries — associateseach cluster with Kubernetes topology labels
clusterIDsStorageClass parameter — a comma-separated list of candidatecluster IDs for topology-based selection
This is a design-only PR. Implementation will follow in a separate PR
once the design is approved.
Is there anything that requires special attention
StorageClasses with a single
clusterIDwork unchanged.volumeBindingMode: WaitForFirstConsumeris required for topology-basedselection (Kubernetes must provide AccessibilityRequirements).
clusterIDparameter takes priority when present —topology selection is only used as a fallback via the new
clusterIDsparameter.
Related issues
Ref: #5177
Future concerns
clusterIDfully optional whenclusterIDsis providedtopologyConstrainedPoolsfor selecting both cluster and pool
Checklist: