fix(toggle-disk): allow retry and cancellation of failed operations #475

kvaps · 2026-01-06T08:28:25Z

Summary

Allow retry and cancellation of failed toggle-disk operations, and cleanup of orphaned storage layers.

Problem

If a toggle-disk operation fails (e.g., due to storage issues or satellite disconnect), the resource is stuck with DISK_ADD_REQUESTED or DISK_REMOVE_REQUESTED flag set. Any further toggle-disk attempts fail with "already requested" error, leaving the user unable to recover.

This is especially problematic for LUKS encrypted resources where partial failures can leave dm-crypt devices and backing storage orphaned.

Solution

Retry - if toggle-disk is called with the same direction and the requested flag is already set, treat it as a retry. The operation is simply re-triggered with existing layer data intact, allowing the satellite to handle it idempotently.
Cancellation - if toggle-disk is called with opposite direction, cancel the pending operation. For add-disk cancellation, use the existing disk removal flow (DISK_REMOVING) to properly cleanup storage on satellite.
Orphaned storage cleanup - if resource is marked diskless but has orphaned storage layers (LUKS or non-diskless STORAGE), trigger cleanup via the disk removal flow.

Note: The retry logic intentionally keeps existing layer data rather than removing/recreating it, because removeLayerData() only deletes from controller DB without calling drbdadm down on satellite, which would leave orphaned DRBD devices in the kernel.

Test plan

Start toggle-disk to add disk, simulate failure mid-operation
Verify retry with same command succeeds
Verify cancellation with toggle-disk --diskless properly cleans up storage
Verify orphaned LUKS/storage layers are cleaned up

## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ```  ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. ✏️ Tip: You can customize this high-level summary in your review settings.

When toggle-disk operation fails (e.g., satellite disconnected during operation), the DISK_ADD_REQUESTED or DISK_REMOVE_REQUESTED flag remains set. Previously, attempting to retry or cancel the operation would throw "already requested" error. This change allows: 1. Retry of the same operation by simply re-triggering the updateAndAdjustDisk flow with existing layer data intact (satellite handles idempotently) 2. Cancellation by requesting the opposite operation (e.g., toggle-disk with removeDisk when DISK_ADD_REQUESTED is set will revert to diskless) 3. Cleanup of orphaned storage layers when resource is marked diskless but still has LUKS or non-diskless storage layers from a failed operation The retry logic no longer removes and recreates layer data, as removeLayerData() only deletes from controller DB without calling drbdadm down on satellite, which would leave orphaned DRBD devices in the kernel. Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]>

The previous retry logic in toggle-disk removed layer data from controller DB and recreated it. However, removeLayerData() only deletes from the database without calling drbdadm down on the satellite, leaving orphaned DRBD devices in the kernel that occupy ports and block new operations. This fix changes retry to simply repeat the operation with existing layer data, allowing the satellite to handle it idempotently. Upstream: LINBIT/linstor-server#475 Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]>

…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ```  ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers ✏️ Tip: You can customize this high-level summary in your review settings.

The previous retry logic in toggle-disk removed layer data from controller DB and recreated it. However, removeLayerData() only deletes from the database without calling drbdadm down on the satellite, leaving orphaned DRBD devices in the kernel that occupy ports and block new operations. This fix changes retry to simply repeat the operation with existing layer data, allowing the satellite to handle it idempotently. Upstream: LINBIT/linstor-server#475 Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]> (cherry picked from commit 8151e1e)

## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ```  ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. ✏️ Tip: You can customize this high-level summary in your review settings.

…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ```  ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers ✏️ Tip: You can customize this high-level summary in your review settings.

## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ```  ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. ✏️ Tip: You can customize this high-level summary in your review settings.

…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ```  ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers ✏️ Tip: You can customize this high-level summary in your review settings.

The previous retry logic in toggle-disk removed layer data from controller DB and recreated it. However, removeLayerData() only deletes from the database without calling drbdadm down on the satellite, leaving orphaned DRBD devices in the kernel that occupy ports and block new operations. This fix changes retry to simply repeat the operation with existing layer data, allowing the satellite to handle it idempotently. Upstream: LINBIT/linstor-server#475 Co-Authored-By: Claude <[email protected]> Signed-off-by: Andrei Kvapil <[email protected]> (cherry picked from commit 8151e1e)

## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ```  ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. ✏️ Tip: You can customize this high-level summary in your review settings.

…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ```  ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers ✏️ Tip: You can customize this high-level summary in your review settings.

ghernadi · 2026-01-20T06:38:11Z

Hey kvaps!

I do like the idea, but I tried your PR and it did not work. Maybe I was just testing differently than you did. I tried to look into the code and realized that the method toggleDiskInTransaction was already too convoluted before your patch, and your patch did not exactly help there.
So I ended up refactoring the method and finally also implementing an enum for RETRY, ABORT, NORMAL (the current behavior) and NOOP (i.e. if the resource is already diskless and the user wants to toggle-disk into diskless). My tests are fine with that.

Do I assume correctly that you are testing by simply adding something like
linstor n sp $node StorDriver/LuksFormatOptions '--luks2-metadata-size 32M' to one of the nodes and then try to abort the toggle-disk afterwards?

kvaps · 2026-01-21T15:54:39Z

Hey @ghernadi, thank you for your work on this.
It's good to know that this functionality has been reviewed and tested by maintainers.

My code is AI-generated and might not fully fit into LINSTOR's coding standards and design decisions.
Feel free to close my PRs if the same functionality is implemented by yourselves.

Regarding the LUKS options — I didn't know that LINSTOR has an option to override them, so I used the defaults.

I encountered an issue where, after upgrading Talos Linux to a newer version, my LUKS volumes couldn't be created because of a different offset (see #472).

Another issue is that once this happens, I cannot do anything — neither delete nor recreate these volumes. Management is completely blocked. My PRs are aimed at recovering from such
situations.

There's also a race condition issue: sometimes drbdadm adjust for LUKS volumes fails because the LUKS device hasn't appeared in device mapper (and thus as a block device) before drbdadm
adjust is executed with this device specified in the res file (probably fixed in #477).

kvaps mentioned this pull request Jan 6, 2026

[linstor] Build linstor-server with custom patches cozystack/cozystack#1726

Merged

kvaps marked this pull request as ready for review January 6, 2026 09:16

kvaps force-pushed the fix/toggle-disk-cancel-retry branch from 810c5ee to 39c9bba Compare January 7, 2026 10:54

kvaps mentioned this pull request Jan 7, 2026

fix(linstor): prevent orphaned DRBD devices during toggle-disk retry cozystack/cozystack#1823

Merged

kvaps mentioned this pull request Jan 14, 2026

fix(drbd): use actual device path in res file during toggle-disk #473

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(toggle-disk): allow retry and cancellation of failed operations #475

fix(toggle-disk): allow retry and cancellation of failed operations #475

Uh oh!

kvaps commented Jan 6, 2026 •

edited

Loading

Uh oh!

ghernadi commented Jan 20, 2026

Uh oh!

kvaps commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(toggle-disk): allow retry and cancellation of failed operations #475

Are you sure you want to change the base?

fix(toggle-disk): allow retry and cancellation of failed operations #475

Uh oh!

Conversation

kvaps commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Uh oh!

ghernadi commented Jan 20, 2026

Uh oh!

kvaps commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kvaps commented Jan 6, 2026 •

edited

Loading