One-shot persist: unified atomic persistence for ChannelManager and ChannelMonitors

LDK's current persistence model requires the `ChannelManager` and each `ChannelMonitor` to be persisted independently, at different times, by different callers. This creates a fundamental consistency problem: on restart, the `ChannelManager` and `ChannelMonitor` states may not agree, which is the root cause of unnecessary force closes after crashes or unclean shutdowns. To mitigate this, a significant amount of reconciliation logic runs on startup to detect and resolve inconsistencies, adding complexity and still not covering all edge cases.

The lack of atomic persistence also forces the existence of channel freezing via `ChannelMonitorUpdateStatus::InProgress`, where channels are paused while persistence catches up. This machinery is difficult to reason about and still has edge cases.

## Proposed approach

Instead of persisting the `ChannelManager` and `ChannelMonitor`s as independent operations, persist them together in a single atomic batch through a queuing KV store layer.

### 1. ChannelManager persists itself at the right moments

Rather than relying on the background processor to periodically call an external persist function, the `ChannelManager` holds a reference to the KV store and writes its own state at exactly two chokepoints: before returning events from `process_pending_events` and `get_and_clear_pending_msg_events`. This guarantees the persisted state always matches the events the caller is about to handle. If the caller crashes mid-handling, events replay on restart.

The key insight is that state changes only need to be persisted before they become externally observable. As long as no events or messages have been handed to the caller, the system can safely restart from the last persisted state and re-derive the same changes. This is why persisting at just these two chokepoints is sufficient.

The current persistence flag mechanism triggers full re-serialization even when nothing recoverable has changed. This approach eliminates those redundant writes entirely.

### 2. Per-channel keys for granular ChannelManager updates

The `ChannelManager` currently serializes all of its channel state into a single blob. Instead, each channel's data (as stored within the `ChannelManager`, not the `ChannelMonitor`) is written to its own KV store key. Combined with change detection (comparing serialized state against the last persist), updating one channel out of thousands writes only that channel's key plus the small manager metadata, not a re-serialization of the entire `ChannelManager`.

### 3. Batched atomic commits

A `QueuedKVStoreSync` wrapper buffers all writes (monitor updates, manager updates) in memory. On `commit()`, it serializes all queued changes into a single value and writes it to the underlying KV store under a unique sequenced key. Because any KV store can write a single key atomically (e.g. `FilesystemStore` uses write-to-temp + rename), this guarantees that either all changes from a commit are persisted or none are. There is never a window where one is persisted without the other, eliminating the force close problem and the need for startup reconciliation.

Multiple monitor updates that occur between two chokepoints are naturally batched into a single write operation instead of hitting disk individually. This benefits both individual forwards (which touch two channels) and busy nodes where many unrelated channel updates accumulate between event processing cycles. This means fewer fsyncs, which is often the dominant cost in persistence-heavy workloads.

On startup, `QueuedKVStoreSync` reads the base snapshot plus any unconsolidated delta keys (ordered by sequence number) and replays them to reconstruct the current state. Reads during normal operation check the in-memory queue first and fall back to the inner store, so callers always see the latest buffered state.

### 4. Async monitor persistence can be removed

The `InProgress` variant of `ChannelMonitorUpdateStatus` was originally added for performance on high-latency storage backends, where blocking on each individual monitor write would be too slow. With batched writes, all monitor updates are queued in memory (no I/O) and flushed as a single write on commit. Since there is only one write operation, the performance concern that motivated per-monitor async persistence no longer applies. The single commit can still happen asynchronously as long as we hold off on sending messages and handling events until it completes. This eliminates the channel freezing machinery and the associated edge cases around fund loss.

### 5. Background consolidation of deltas

Each commit writes a small delta containing only the changes since the last commit. Over time these deltas accumulate. A background thread can read all outstanding deltas, integrate them into a new full state snapshot, and then remove the consumed deltas. This takes over the role of the current consolidation mechanism in `MonitorUpdatingPersister` (which replays individual monitor updates into full monitor snapshots), and extends it to also consolidate partial `ChannelManager` saves where only changed channels were persisted. Consolidation is purely an optimization that does not affect correctness; the system works fine with any number of unconsolidated deltas, it just means startup takes longer as more keys need to be read and replayed. The consolidation thread has no interaction with the hot path and can run at whatever pace the storage backend allows.

### 6. Extensible to application state

Higher-level application state (e.g. a payment store in ldk-node) can piggyback on the same atomic commit, keeping application data consistent with LDK's internal state without additional coordination.

## Trade-offs

Serializing all updates into a single system delta means that channels which could in theory operate completely independently (e.g. two unrelated forwards touching four different channels) are funneled through one write. At very large scale, this could become a bottleneck compared to a model where each channel or channel pair persists independently. In practice, we are likely far from the point where this matters, and there are other scaling paths such as running multiple nodes (though that is less capital efficient). The simplicity gained by having a single, consistent persistence model is worth a lot: it eliminates entire classes of bugs and makes the system much easier to reason about.

## Proof of concept

An early proof of concept is available at https://github.com/joostjager/rust-lightning/pull/new/one-shot-persist. It is incomplete and not production-ready, but demonstrates the core ideas described above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One-shot persist: unified atomic persistence for ChannelManager and ChannelMonitors #4457

Proposed approach

1. ChannelManager persists itself at the right moments

2. Per-channel keys for granular ChannelManager updates

3. Batched atomic commits

4. Async monitor persistence can be removed

5. Background consolidation of deltas

6. Extensible to application state

Trade-offs

Proof of concept

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

One-shot persist: unified atomic persistence for ChannelManager and ChannelMonitors #4457

Description

Proposed approach

1. ChannelManager persists itself at the right moments

2. Per-channel keys for granular ChannelManager updates

3. Batched atomic commits

4. Async monitor persistence can be removed

5. Background consolidation of deltas

6. Extensible to application state

Trade-offs

Proof of concept

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions