-
Notifications
You must be signed in to change notification settings - Fork 91
fix(toggle-disk): allow retry and cancellation of failed operations #475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
When toggle-disk operation fails (e.g., satellite disconnected during operation), the DISK_ADD_REQUESTED or DISK_REMOVE_REQUESTED flag remains set. Previously, attempting to retry or cancel the operation would throw "already requested" error. This change allows: 1. Retry of the same operation by simply re-triggering the updateAndAdjustDisk flow with existing layer data intact (satellite handles idempotently) 2. Cancellation by requesting the opposite operation (e.g., toggle-disk with removeDisk when DISK_ADD_REQUESTED is set will revert to diskless) 3. Cleanup of orphaned storage layers when resource is marked diskless but still has LUKS or non-diskless storage layers from a failed operation The retry logic no longer removes and recreates layer data, as removeLayerData() only deletes from controller DB without calling drbdadm down on satellite, which would leave orphaned DRBD devices in the kernel. Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]>
810c5ee to
39c9bba
Compare
The previous retry logic in toggle-disk removed layer data from controller DB and recreated it. However, removeLayerData() only deletes from the database without calling drbdadm down on the satellite, leaving orphaned DRBD devices in the kernel that occupy ports and block new operations. This fix changes retry to simply repeat the operation with existing layer data, allowing the satellite to handle it idempotently. Upstream: LINBIT/linstor-server#475 Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]>
…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
The previous retry logic in toggle-disk removed layer data from controller DB and recreated it. However, removeLayerData() only deletes from the database without calling drbdadm down on the satellite, leaving orphaned DRBD devices in the kernel that occupy ports and block new operations. This fix changes retry to simply repeat the operation with existing layer data, allowing the satellite to handle it idempotently. Upstream: LINBIT/linstor-server#475 Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]> (cherry picked from commit 8151e1e)
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
The previous retry logic in toggle-disk removed layer data from controller DB and recreated it. However, removeLayerData() only deletes from the database without calling drbdadm down on the satellite, leaving orphaned DRBD devices in the kernel that occupy ports and block new operations. This fix changes retry to simply repeat the operation with existing layer data, allowing the satellite to handle it idempotently. Upstream: LINBIT/linstor-server#475 Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]> (cherry picked from commit 8151e1e)
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
|
Hey kvaps! I do like the idea, but I tried your PR and it did not work. Maybe I was just testing differently than you did. I tried to look into the code and realized that the method Do I assume correctly that you are testing by simply adding something like |
|
Hey @ghernadi, thank you for your work on this. My code is AI-generated and might not fully fit into LINSTOR's coding standards and design decisions. Regarding the LUKS options — I didn't know that LINSTOR has an option to override them, so I used the defaults. I encountered an issue where, after upgrading Talos Linux to a newer version, my LUKS volumes couldn't be created because of a different offset (see #472). Another issue is that once this happens, I cannot do anything — neither delete nor recreate these volumes. Management is completely blocked. My PRs are aimed at recovering from such There's also a race condition issue: sometimes drbdadm adjust for LUKS volumes fails because the LUKS device hasn't appeared in device mapper (and thus as a block device) before drbdadm |
Summary
Allow retry and cancellation of failed toggle-disk operations, and cleanup of orphaned storage layers.
Problem
If a toggle-disk operation fails (e.g., due to storage issues or satellite disconnect), the resource is stuck with
DISK_ADD_REQUESTEDorDISK_REMOVE_REQUESTEDflag set. Any further toggle-disk attempts fail with "already requested" error, leaving the user unable to recover.This is especially problematic for LUKS encrypted resources where partial failures can leave dm-crypt devices and backing storage orphaned.
Solution
Retry - if toggle-disk is called with the same direction and the requested flag is already set, treat it as a retry. The operation is simply re-triggered with existing layer data intact, allowing the satellite to handle it idempotently.
Cancellation - if toggle-disk is called with opposite direction, cancel the pending operation. For add-disk cancellation, use the existing disk removal flow (
DISK_REMOVING) to properly cleanup storage on satellite.Orphaned storage cleanup - if resource is marked diskless but has orphaned storage layers (LUKS or non-diskless STORAGE), trigger cleanup via the disk removal flow.
Note: The retry logic intentionally keeps existing layer data rather than removing/recreating it, because
removeLayerData()only deletes from controller DB without callingdrbdadm downon satellite, which would leave orphaned DRBD devices in the kernel.Test plan
toggle-disk --disklessproperly cleans up storage