Adding multi-cluster support via array of cluster configs based on topology#6116
Adding multi-cluster support via array of cluster configs based on topology#6116dridge93 wants to merge 1 commit intoceph:develfrom
Conversation
ef3f0c6 to
035176f
Compare
|
Has anyone had a chance to review this PR? |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions. |
|
This pull request now has conflicts with the target branch. Could you please resolve conflicts and force push the corrected changes? 🙏 |
…pology Signed-off-by: Devin Ridge <dridge@globalnoc.iu.edu> Signed-off-by: Devin Ridge <dridge@globalnoc.iu.edu>
035176f to
933bbff
Compare
|
I found what looks like a functional bug in the original multi-AZ / multi-cluster topology patch The new helper already returns topology requirements for the func GetClusterTopologiesFromRequest(
req *csi.CreateVolumeRequest,
) (*[]ClusterTopology, *csi.TopologyRequirement, error)and inside it: accessibilityRequirements := req.GetAccessibilityRequirements()
...
return &clusterTopologies, accessibilityRequirements, nilBut in parseVolCreateRequest() that return value is discarded: rbdVol.TopologyPools, rbdVol.TopologyRequirement, err = util.GetTopologyFromRequest(req)
...
rbdVol.ClusterTopologies, _, err = util.GetClusterTopologiesFromRequest(req)So when the StorageClass uses only: parameters:
clusterTopologyConfigMap: ceph-csi-rbd-cluster-topologyand does not use topologyConstrainedPools, then: - rbdVol.ClusterTopologies is populated
- rbdVol.TopologyRequirement stays nilLater both CreateVolume() and updateTopologyConstraints() do: selectedCluster, _, err = util.FindClusterAndTopology(
rbdVol.ClusterTopologies,
rbdVol.TopologyRequirement,
)or: cluster, topology, err := util.FindClusterAndTopology(
rbdVol.ClusterTopologies,
rbdVol.TopologyRequirement,
)but FindClusterAndTopology() starts with: if clusterTopologies == nil || accessibilityRequirements == nil {
return ClusterTopology{}, nil, nil
}So the cluster selection silently returns empty, and CreateVolume fails with: no matching cluster found for provided topology requirements even though the incoming request already contains valid topology like: "accessibility_requirements": {
"preferred": [
{"segments":{"topology.rbd.csi.ceph.com/zone":"az01"}},
{"segments":{"topology.rbd.csi.ceph.com/zone":"az02"}}
]
}and the live clusterTopologyConfigMap contains matching zones: {
"clusterTopology": [
{
"clusterID": "...",
"zones": ["az01"],
...
},
{
"clusterID": "...",
"zones": ["az02"],
...
}
]
}The minimal fix seems to be to preserve the topology requirement returned by GetClusterTopologiesFromRequest() when diff --git a/internal/rbd/controllerserver.go b/internal/rbd/controllerserver.go
index 6ef562373..dd4a3a148 100644
--- a/internal/rbd/controllerserver.go
+++ b/internal/rbd/controllerserver.go
@@ -237,10 +237,14 @@ func (cs *ControllerServer) parseVolCreateRequest(
}
// store cluster topology information from the request if present
- rbdVol.ClusterTopologies, _, err = util.GetClusterTopologiesFromRequest(req)
+ var clusterTopologyRequirement *csi.TopologyRequirement
+ rbdVol.ClusterTopologies, clusterTopologyRequirement, err = util.GetClusterTopologiesFromRequest(req)
if err != nil {
return nil, status.Error(codes.InvalidArgument, err.Error())
}
+ if rbdVol.TopologyRequirement == nil {
+ rbdVol.TopologyRequirement = clusterTopologyRequirement
+ }
// parse QOS parameters from mutable parameters
err = rbdVol.SetQOS(ctx, req.GetMutableParameters()) |
|
@dridge93 In selectedCluster, _, err = util.FindClusterAndTopology(...)
...
rbdVol.ClusterSecretName = selectedCluster.SecretName
...
if len(secrets) == 0 && selectedCluster.SecretName != "" {
secrets, err = k8s.GetSecret(selectedCluster.SecretName, namespace)
}and then clusterSecretName is persisted into VolumeContext: if rbdVol.ClusterSecretName != "" {
vol.VolumeContext["clusterSecretName"] = rbdVol.ClusterSecretName
}So the controller side already knows which secret belongs to the selected cluster. However, on the node path, NodeStageVolume() still requires standard CSI node-stage secrets to be present in the request: if req.GetSecrets() == nil || len(req.GetSecrets()) == 0 {
return status.Error(codes.InvalidArgument, "stage secrets cannot be nil or empty")
}and then immediately does: cr, err := util.NewUserCredentialsWithMigration(req.GetSecrets())This means that a StorageClass using only: parameters:
clusterTopologyConfigMap: ceph-csi-rbd-cluster-topologystill fails at node stage unless csi.storage.k8s.io/node-stage-secret-* is also set statically. That is problematic for the multi-cluster mode, because:
The minimal fix is:
Example fix: diff --git a/internal/util/validate.go b/internal/util/validate.go
@@
- if req.GetSecrets() == nil || len(req.GetSecrets()) == 0 {
- return status.Error(codes.InvalidArgument, "stage secrets cannot be nil or empty")
- }
+ if req.GetSecrets() == nil || len(req.GetSecrets()) == 0 {
+ if req.GetVolumeContext()["clusterSecretName"] == "" {
+ return status.Error(codes.InvalidArgument, "stage secrets cannot be nil or empty")
+ }
+ }
diff --git a/internal/rbd/nodeserver.go b/internal/rbd/nodeserver.go
@@
- cr, err := util.NewUserCredentialsWithMigration(req.GetSecrets())
+ secrets := req.GetSecrets()
+ if len(secrets) == 0 {
+ clusterSecretName := req.GetVolumeContext()["clusterSecretName"]
+ if clusterSecretName != "" {
+ namespace, nsErr := util.GetPodNamespace()
+ if nsErr != nil {
+ return nil, status.Error(codes.InvalidArgument, nsErr.Error())
+ }
+ secrets, err = k8s.GetSecret(clusterSecretName, namespace)
+ if err != nil {
+ return nil, status.Error(codes.InvalidArgument, err.Error())
+ }
+ }
+ }
+
+ cr, err := util.NewUserCredentialsWithMigration(secrets)Without this, the multi-cluster topology flow is only partially implemented:
|
Adding multi-cluster support via array of cluster configs based on topology
Describe what this PR does
Adds multi-cluster support to Ceph CSI RBD (#5177)
Related issues
Fixes: #5177
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>: retest the<job-name>after unrelatedfailure (please report the failure too!)