Skip to content

Conversation

@kvaps
Copy link

@kvaps kvaps commented Jan 6, 2026

Summary

Allow retry and cancellation of failed toggle-disk operations, and cleanup of orphaned storage layers.

Problem

If a toggle-disk operation fails (e.g., due to storage issues or satellite disconnect), the resource is stuck with DISK_ADD_REQUESTED or DISK_REMOVE_REQUESTED flag set. Any further toggle-disk attempts fail with "already requested" error, leaving the user unable to recover.

This is especially problematic for LUKS encrypted resources where partial failures can leave dm-crypt devices and backing storage orphaned.

Solution

  1. Retry - if toggle-disk is called with the same direction and the requested flag is already set, treat it as a retry. The operation is simply re-triggered with existing layer data intact, allowing the satellite to handle it idempotently.

  2. Cancellation - if toggle-disk is called with opposite direction, cancel the pending operation. For add-disk cancellation, use the existing disk removal flow (DISK_REMOVING) to properly cleanup storage on satellite.

  3. Orphaned storage cleanup - if resource is marked diskless but has orphaned storage layers (LUKS or non-diskless STORAGE), trigger cleanup via the disk removal flow.

Note: The retry logic intentionally keeps existing layer data rather than removing/recreating it, because removeLayerData() only deletes from controller DB without calling drbdadm down on satellite, which would leave orphaned DRBD devices in the kernel.

Test plan

  • Start toggle-disk to add disk, simulate failure mid-operation
  • Verify retry with same command succeeds
  • Verify cancellation with toggle-disk --diskless properly cleans up storage
  • Verify orphaned LUKS/storage layers are cleaned up

@kvaps kvaps marked this pull request as ready for review January 6, 2026 09:16
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 6, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
When toggle-disk operation fails (e.g., satellite disconnected during operation),
the DISK_ADD_REQUESTED or DISK_REMOVE_REQUESTED flag remains set. Previously,
attempting to retry or cancel the operation would throw "already requested" error.

This change allows:
1. Retry of the same operation by simply re-triggering the updateAndAdjustDisk flow
   with existing layer data intact (satellite handles idempotently)
2. Cancellation by requesting the opposite operation (e.g., toggle-disk with removeDisk
   when DISK_ADD_REQUESTED is set will revert to diskless)
3. Cleanup of orphaned storage layers when resource is marked diskless but still has
   LUKS or non-diskless storage layers from a failed operation

The retry logic no longer removes and recreates layer data, as removeLayerData()
only deletes from controller DB without calling drbdadm down on satellite,
which would leave orphaned DRBD devices in the kernel.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Andrei Kvapil <[email protected]>
@kvaps kvaps force-pushed the fix/toggle-disk-cancel-retry branch from 810c5ee to 39c9bba Compare January 7, 2026 10:54
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 7, 2026
The previous retry logic in toggle-disk removed layer data from controller DB
and recreated it. However, removeLayerData() only deletes from the database
without calling drbdadm down on the satellite, leaving orphaned DRBD devices
in the kernel that occupy ports and block new operations.

This fix changes retry to simply repeat the operation with existing layer data,
allowing the satellite to handle it idempotently.

Upstream: LINBIT/linstor-server#475

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Andrei Kvapil <[email protected]>
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 7, 2026
…1823)

## Summary

Fix bug in toggle-disk retry logic that left orphaned DRBD devices in
kernel.

## Problem

When toggle-disk retry was triggered (e.g., user retries after a failed
operation), the code called `removeLayerData()` to clean up and recreate
the layer stack. However, `removeLayerData()` only removes data from the
controller's database — it does NOT call `drbdadm down` on the
satellite.

This caused DRBD devices to remain in the kernel (visible in `drbdsetup`
but not managed by LINSTOR), occupying ports and blocking subsequent
operations.

## Solution

Changed retry logic to simply repeat the operation with existing layer
data intact. The satellite handles this idempotently without creating
orphaned resources.

## Upstream

- LINBIT/linstor-server#475 (updated)

```release-note
[linstor] Fix orphaned DRBD devices during toggle-disk retry
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Added cancel and retry capabilities for disk addition operations
  * Added cancel and retry capabilities for disk removal operations
* Improved cleanup handling for diskless resources with orphaned storage
layers

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
github-actions bot pushed a commit to cozystack/cozystack that referenced this pull request Jan 7, 2026
The previous retry logic in toggle-disk removed layer data from controller DB
and recreated it. However, removeLayerData() only deletes from the database
without calling drbdadm down on the satellite, leaving orphaned DRBD devices
in the kernel that occupy ports and block new operations.

This fix changes retry to simply repeat the operation with existing layer data,
allowing the satellite to handle it idempotently.

Upstream: LINBIT/linstor-server#475

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Andrei Kvapil <[email protected]>
(cherry picked from commit 8151e1e)
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 8, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 8, 2026
…1823)

## Summary

Fix bug in toggle-disk retry logic that left orphaned DRBD devices in
kernel.

## Problem

When toggle-disk retry was triggered (e.g., user retries after a failed
operation), the code called `removeLayerData()` to clean up and recreate
the layer stack. However, `removeLayerData()` only removes data from the
controller's database — it does NOT call `drbdadm down` on the
satellite.

This caused DRBD devices to remain in the kernel (visible in `drbdsetup`
but not managed by LINSTOR), occupying ports and blocking subsequent
operations.

## Solution

Changed retry logic to simply repeat the operation with existing layer
data intact. The satellite handles this idempotently without creating
orphaned resources.

## Upstream

- LINBIT/linstor-server#475 (updated)

```release-note
[linstor] Fix orphaned DRBD devices during toggle-disk retry
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Added cancel and retry capabilities for disk addition operations
  * Added cancel and retry capabilities for disk removal operations
* Improved cleanup handling for diskless resources with orphaned storage
layers

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 8, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 8, 2026
…1823)

## Summary

Fix bug in toggle-disk retry logic that left orphaned DRBD devices in
kernel.

## Problem

When toggle-disk retry was triggered (e.g., user retries after a failed
operation), the code called `removeLayerData()` to clean up and recreate
the layer stack. However, `removeLayerData()` only removes data from the
controller's database — it does NOT call `drbdadm down` on the
satellite.

This caused DRBD devices to remain in the kernel (visible in `drbdsetup`
but not managed by LINSTOR), occupying ports and blocking subsequent
operations.

## Solution

Changed retry logic to simply repeat the operation with existing layer
data intact. The satellite handles this idempotently without creating
orphaned resources.

## Upstream

- LINBIT/linstor-server#475 (updated)

```release-note
[linstor] Fix orphaned DRBD devices during toggle-disk retry
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Added cancel and retry capabilities for disk addition operations
  * Added cancel and retry capabilities for disk removal operations
* Improved cleanup handling for diskless resources with orphaned storage
layers

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
nbykov0 pushed a commit to cozystack/cozystack that referenced this pull request Jan 9, 2026
The previous retry logic in toggle-disk removed layer data from controller DB
and recreated it. However, removeLayerData() only deletes from the database
without calling drbdadm down on the satellite, leaving orphaned DRBD devices
in the kernel that occupy ports and block new operations.

This fix changes retry to simply repeat the operation with existing layer data,
allowing the satellite to handle it idempotently.

Upstream: LINBIT/linstor-server#475

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Andrei Kvapil <[email protected]>
(cherry picked from commit 8151e1e)
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 9, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 9, 2026
…1823)

## Summary

Fix bug in toggle-disk retry logic that left orphaned DRBD devices in
kernel.

## Problem

When toggle-disk retry was triggered (e.g., user retries after a failed
operation), the code called `removeLayerData()` to clean up and recreate
the layer stack. However, `removeLayerData()` only removes data from the
controller's database — it does NOT call `drbdadm down` on the
satellite.

This caused DRBD devices to remain in the kernel (visible in `drbdsetup`
but not managed by LINSTOR), occupying ports and blocking subsequent
operations.

## Solution

Changed retry logic to simply repeat the operation with existing layer
data intact. The satellite handles this idempotently without creating
orphaned resources.

## Upstream

- LINBIT/linstor-server#475 (updated)

```release-note
[linstor] Fix orphaned DRBD devices during toggle-disk retry
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Added cancel and retry capabilities for disk addition operations
  * Added cancel and retry capabilities for disk removal operations
* Improved cleanup handling for diskless resources with orphaned storage
layers

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@ghernadi
Copy link
Contributor

Hey kvaps!

I do like the idea, but I tried your PR and it did not work. Maybe I was just testing differently than you did. I tried to look into the code and realized that the method toggleDiskInTransaction was already too convoluted before your patch, and your patch did not exactly help there.
So I ended up refactoring the method and finally also implementing an enum for RETRY, ABORT, NORMAL (the current behavior) and NOOP (i.e. if the resource is already diskless and the user wants to toggle-disk into diskless). My tests are fine with that.

Do I assume correctly that you are testing by simply adding something like
linstor n sp $node StorDriver/LuksFormatOptions '--luks2-metadata-size 32M' to one of the nodes and then try to abort the toggle-disk afterwards?

@kvaps
Copy link
Author

kvaps commented Jan 21, 2026

Hey @ghernadi, thank you for your work on this.
It's good to know that this functionality has been reviewed and tested by maintainers.

My code is AI-generated and might not fully fit into LINSTOR's coding standards and design decisions.
Feel free to close my PRs if the same functionality is implemented by yourselves.

Regarding the LUKS options — I didn't know that LINSTOR has an option to override them, so I used the defaults.

I encountered an issue where, after upgrading Talos Linux to a newer version, my LUKS volumes couldn't be created because of a different offset (see #472).

Another issue is that once this happens, I cannot do anything — neither delete nor recreate these volumes. Management is completely blocked. My PRs are aimed at recovering from such
situations.

There's also a race condition issue: sometimes drbdadm adjust for LUKS volumes fails because the LUKS device hasn't appeared in device mapper (and thus as a block device) before drbdadm
adjust is executed with this device specified in the res file (probably fixed in #477).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants