fix: prevent resource definition deletion when PVC still exists#429
Open
kvaps wants to merge 6 commits intopiraeusdatastore:masterfrom
Open
fix: prevent resource definition deletion when PVC still exists#429kvaps wants to merge 6 commits intopiraeusdatastore:masterfrom
kvaps wants to merge 6 commits intopiraeusdatastore:masterfrom
Conversation
When restoring a snapshot, the previous logic always selected the first available node from the snapshot's node list. This caused all clones of the same source volume to restore on the same node, concentrating storage load and potentially exhausting the thin pool on that single node. Randomize the node selection among available candidates to distribute restore operations evenly across snapshot nodes. Preferred (topology- matching) nodes are still prioritized, but a random one is chosen when multiple candidates exist. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…parameters Add two new StorageClass parameters that trigger asynchronous relocation of replicas to optimal nodes after clone or snapshot restore operations. When cloning volumes or restoring from snapshots, LINSTOR places replicas on the same nodes as the source. For golden image use cases this concentrates all clones on source nodes. The new parameters use LINSTOR's query-size-info API to determine optimal placement and migrate-disk API to relocate replicas asynchronously. Both parameters default to true. Relocation is best-effort: failures are logged as warnings and do not block volume creation. An aux property on the resource definition ensures idempotency across CSI controller retries. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Move the snapshot restore relocation parameter from StorageClass to VolumeSnapshotClass where it belongs. This prevents unwanted relocation when Velero creates temporary PVCs during data mover backup (snapshotMoveData), since those PVCs use StorageClass parameters but not VolumeSnapshotClass parameters. Changes: - Remove relocateAfterSnapshotRestore from StorageClass parameters - Add snap.linstor.csi.linbit.com/relocate-after-restore to VolumeSnapshotClass parameters - Change relocateAfterClone default to false Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Rename snap.linstor.csi.linbit.com/relocate-after-restore to snap.linstor.csi.linbit.com/relocateAfterRestore for consistency with StorageClass parameter naming convention. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Before deleting a resource definition, check if the corresponding Kubernetes PV is still Bound to a PVC that is not being deleted. This prevents accidental RD deletion when resources are stuck in DELETING state (e.g. due to DRBD bitmap mismatch), which could cause data loss for active volumes. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Contributor
|
Thanks for looking into this. There are commits in this PR that are unrelated to the description. Please keep the PR focused on one topic. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Before deleting a resource definition, check if the corresponding Kubernetes PV
is still Bound to a PVC that is not being deleted. This prevents accidental RD
deletion when resources are stuck in DELETING state.
Problem
When DRBD resources get stuck in DELETING state (e.g. due to bitmap mismatch
after toggle-disk),
deleteResourceDefinitionAndGroupIfUnused()checks onlywhether all resources have the
FlagDeleteflag. If they do, it proceeds todelete the entire resource definition -- even if the PVC is still active and
Bound.
This can lead to data loss: the PVC remains Bound, the VM continues running
with cached data, but the LINSTOR resource definition is deleted.
Fix
Add a safety check at the beginning of
deleteResourceDefinitionAndGroupIfUnused():query the Kubernetes API for the PV (by name, which matches the LINSTOR resource
name). If the PV exists, is not being deleted, and is Bound to a PVC that is also
not being deleted -- abort the RD deletion.
Uses the existing
dynamic.Interfaceclient pattern already used elsewhere in thecodebase.