[NEW] A new SAVE path for Valkey: DollySave - up to 35× faster, no COW, OOM-safe

**TL;DR:** BGSAVE takes ~18 minutes and can OOM on a 300 GB dataset; DollySave completes in ~30 seconds and never OOM.

`BGSAVE` has served Valkey well, but on multi-hundred-GB instances it has several issues: it takes a long time, its memory overhead grows with the write rate (copy-on-write), and in the worst case it runs the host out of memory. The duration of SAVE is critical for operational workflows such as scale-up, increasing the number of replicas, or recovering after a connection loss. In these scenarios, replicas must become active quickly; otherwise, read traffic is redirected to other nodes and can overwhelm them, leading to cascading performance degradation.

We prototyped **DollySave**, a drop-in alternative SAVE path. On a 300 GB dataset it is **up to 35× faster**, uses a **constant, tiny amount of extra memory** regardless of write pressure (instead of scaling with the write rate the way `BGSAVE` does), shortens the process freeze by **up to ~6.6 s**, recovers p50/p99 in **~3–4 s
instead of ~19–49 s**, and  most importantly  **never OOMs**, even under heavy write traffic that kills `BGSAVE` today.

The Valkey-side change required to support this is tiny: **one new admin command, no changes to the existing `BGSAVE` path**. PR link: https://github.com/valkey-io/valkey/pull/3608

The rest of this issue walks through the benchmarks scenario by scenario, building up from a quiet server to a write-heavy one, so you can see exactly where `BGSAVE` falls over and where DollySave holds up.

---

## Setup

- **Instance:** `r7g.16xlarge` (Graviton3, 64 vCPU, 512 GB RAM)
- **Dataset:** 4,500,000 keys × 512 bytes ≈ **300 GB**

---

## Scenario 1 - idle server: how much does SAVE cost when *nothing* is happening?

Before we pile on traffic, let's ask the simplest question: if the server is doing **nothing at all**, how expensive is it just to take a snapshot?

|                          | BGSAVE        | DollySave |
| ------------------------ | ------------- | --------- |
| Duration                 | **1102.31 s** | **31.24 s** (**35.3× faster**) |
| Extra memory during SAVE | 2.28 GB    (COW)   | 37 MB     |
| Process freeze time      | 4207 ms       | 233 ms (−3974 ms) |

On an idle 300 GB server, `BGSAVE` takes over **18 minutes**. DollySave finishes in **31 seconds**. The process-freeze interval, the window during which Valkey cannot serve *any* command, drops from 4.2 s to 233 ms.

No traffic, no writes, no contention. This is the *best case* for `BGSAVE` and it's still 35× slower than DollySave.

---

## Scenario 2 - add 400K read TPS: what happens under realistic reads?

Now let's put the server under a realistic read load. We keep writes light (1.5K SET/s) and push reads up to a steady **400,000 GET/s**.

|                          | BGSAVE     | DollySave |
| ------------------------ | ---------- | --------- |
| Duration                 | **1084 s** | **32 s** (**33.9× faster**) |
| Extra memory during SAVE | 24.0 GB (COW)   | 97 MB     |
| Process freeze time      | 4173 ms    | 1216 ms (−2957 ms) |

Duration is essentially unchanged from Scenario 1 even though we've added a large read load, only the 1,500 SET/s actually dirty pages, so `BGSAVE`'s COW cost stays relatively manageable.

Look at the extra-memory row, though: **`BGSAVE` has already jumped from 2.28 GB (Scenario 1) to 24 GB, and we haven't even turned writes up yet.** DollySave went from 37 MB to 97 MB.

The more interesting result here is **how long it takes p50 / p99 to recover** after SAVE is called.

### Latency recovery after SAVE is called

Both SAVE paths cause a latency disturbance to user read and write traffic, we care about how long each one takes to come back to steady state.

- **`BGSAVE`** forks the process, and every subsequent write from the parent forces the kernel to copy a page out of shared memory. That copy-on-write traffic stays elevated throughout the SAVE.

- **DollySave** asks the kernel to write-protect memory so it can track which pages get dirtied. When the application writes to a tracked page, **the kernel just marks the page as dirty no copies, no userspace fault handler**. 

So both series below start at **`t = 0` = post the freeze time**, and we measure **how long p50 and p99 take to recover to steady state**.

<img width="1485" height="763" alt="Image" src="https://github.com/user-attachments/assets/22b59bd0-2182-4709-a066-8c2919a0caba" />

- Four series on a shared x-axis (seconds since SAVE was called). 
- Y-axis is log-scaled. *

Headlines from the data (with `t = 0` = SAVE called, process-freeze samples excluded):

- **BGSAVE p99** enters the post-freeze window at ~5 ms and takes **~19 s** to return to steady state.
- **BGSAVE p50** peaks at 4.53 ms, **~18 s** to return.
- **DollySave p99** peaks at **2.66 ms**, back to steady state in **~3 s**.
- **DollySave p50** peaks at 2.17 ms, same **~3 s** recovery.

For any service with an SLO on p99, the recovery window matters as much as the SAVE duration itself. A ~19-second window during which p99 stays elevated is a real, user-visible event.

---

## Scenario 3 - add 100K write TPS: 

Now we add the workload that breaks `BGSAVE`: **400K GET/s + 100K SET/s** on the same 300 GB dataset.

This is the scenario where `BGSAVE`'s fundamental design `fork()` and let the kernel copy-on-write  stops being a tradeoff and starts being a liability.

Remember the setup: we're on a **512 GB machine using only 300 GB  for Valkey data**, which leaves ~200 GB of headroom. Under this workload, `BGSAVE`'s COW cost alone **consumed that entire ~200 GB
of headroom** before the snapshot could finish  at which point the host ran out of memory.

|                          | BGSAVE                   | DollySave |
| ------------------------ | ------------------------ | --------- |
| Duration                 | 527 s **then OOM crash** | 32.99 s   |
| Extra memory during SAVE | **~200 GB (COW) - host OOM**   | 144 MB    |
| Process freeze time      | 8346 ms                  | 1736 ms (−6610 ms) |

DollySave used **144 MB** here in the same ballpark as its idle-server number (37 MB) and its read-heavy number (97 MB). The extra-memory cost **does not scale with the write rate**. It completed the snapshot in 33 seconds and never came close to OOM.

### Latency recovery after SAVE is called (same story, bigger gap)

Same four-series picture as Scenario 2 - both series start at **`t = 0` = post freeze time**, process-freeze samples excluded, and we measure **how long p50 and p99 take to recover to steady state**. 

*Same four series as Scenario 2, under heavier write load. DollySave (blue) drops back to steady state after ~4 s; BGSAVE (red)
takes ~49 s. Y-axis is log-scaled. *

<img width="1485" height="763" alt="Image" src="https://github.com/user-attachments/assets/24e4a8e6-a176-4a83-a4f5-0a7d242c8719" />

Recovery time (with `t = 0` = SAVE called, process-freeze samples excluded):

- **BGSAVE p99** enters the post-freeze window at ~6.6 ms and takes **~49 s** to return to steady state.
- **BGSAVE p50** takes **~48 s** to recover.
- **DollySave p99** is back to steady state in **~4 s**.
- **DollySave p50** is back to steady state in **~4 s**.


---

## Scenario 4 - single thread, small box: is DollySave just "parallel"?

A fair skeptic at this point says: *"You're beating `BGSAVE` because you've thrown more threads at it. Hold parallelism constant and the advantage disappears."*

So we ran Scenario 4 on a smaller `r7g.xlarge` with a **single sender thread** and 220 million small keys (50 B each) - only 23GB, different shape.

|                          | BGSAVE     | DollySave |
| ------------------------ | ---------- | --------- |
| Duration                 | 149.68 s   | 23.16 s (**6.46× faster**) |
| Extra memory during SAVE | 149 MB     | 9 MB      |
| Process freeze time      | 329 ms     | 85 ms (−244 ms) |

Even **single-threaded**, with none of the parallel-sender machinery engaged, DollySave is **6.5× faster** than `BGSAVE`. The advantage isn't coming only from threads  it's coming also from the fact that we send raw memory with no serialization. DollySave avoids object serialization entirely by streaming raw memory, which removes a major CPU and latency component inherent in RDB generation. We have not tested complex objects that may increase the duration gap even further.

---

## Summary across all four scenarios

| Scenario                   | Duration (BGSAVE → Dolly)     | Extra RSS (BGSAVE → Dolly) | Freeze (BGSAVE → Dolly) |
| -------------------------- | ----------------------------- | -------------------------- | ----------------------- |
| 1. Idle                    | 1102 s → 31 s  (**35×**)      | 2.28 GB → 37 MB            | 4207 ms → 233 ms        |
| 2. 400K GET + 1.5K SET     | 1084 s → 32 s  (**34×**)      | 24.0 GB → 97 MB            | 4173 ms → 1216 ms       |
| 3. 400K GET + 100K SET     | 527 s + **OOM** → 33 s        | **~200 GB + OOM** → 144 MB | 8346 ms → 1736 ms       |
| 4. Single-thread (small instance)  | 150 s → 23 s   (**6.5×**)     | 149 MB → 9 MB              | 329 ms → 85 ms          |

---

## What this costs Valkey to support

Almost nothing. **One new admin command** `CLEAN_STATE_FOR_DOLLY_SAVE`  **~240 LoC**, run once on the restored target to clear source-host identity (runid, cluster node-id, peers, epochs) while preserving replication state so the target can `PSYNC` partial-resync against its new primary. The `BGSAVE`/`fork()` path is **completely untouched**.

Full PR (draft): https://github.com/valkey-io/valkey/pull/3608

> **Scope disclaimer.** DollySave has only been tested in a **standalone** Valkey deployment so far, the focus of the work to date has been on the CRIU-side dump/restore logic, not on Valkey-side cluster behaviour. The PR linked above is a **draft** intended to scope the Valkey-side change and start the conversation; once the community aligns on direction, we'll expand testing to cover the cluster scenarios (failover, slot migration, multi-shard PSYNC, etc.) and add corresponding tests.

---

## How it works

DollySave does **not** `fork()` the process. Instead, it treats a snapshot as a **live process migration**: the process keeps running and serving traffic while its memory is streamed out; only a brief final pass runs while the process is frozen.

Three moves, at a high level:

1. **Track writes without copying.** Ask the kernel to write-protect the process's memory (via `UFFD_FEATURE_WP_ASYNC`) so we can later ask which pages got dirtied. When the process writes to a tracked page, the kernel just marks it as dirty no copies, no userspace fault handler, no COW page duplication. Dirty bits are read back in bulk via `PAGEMAP_SCAN`.
2. **Stream memory while the process runs.** Parallel workers copy memory out, compressed. Pages that get re-dirtied are simply re-sent and overwrite the older copy on the receiving side. The process keeps serving reads and writes the whole time. 
3. **Brief final freeze.** Once the dirty set has converged, freeze the process just long enough to capture the last-moment dirty pages and process-tree metadata, then unfreeze. This is the only part of SAVE during which Valkey cannot serve commands and it's measured in hundreds of milliseconds to ~1.7 s in our tests, instead of the multi-second freezes `BGSAVE` produces.

> **A second, freeze-free variant exists.** We also implemented a *synchronous* WP mode (`UFFD_FEATURE_WP`, not `WP_ASYNC`) in which the kernel delivers a userfaultfd event on every write and the dump path ships each page as it is written. That variant has **no final freeze at all** - but the per-write userspace round-trip imposes a real p50/p99 tax on the application. We chose the async/final-freeze design for the results shown above because it gives better application latency during SAVE. The sync variant is still in the tree and may be a better fit for workloads that cannot tolerate *any* freeze, at the cost of elevated tail latency throughout SAVE.

The heavy lifting (write-protect tracking, parallel transfer, restore coordination) lives in an **experimental fork of CRIU**. From Valkey's perspective, the entire dump mechanism is external  which is why the Valkey-side change is just the one admin command above.

Full design docs:

- **High-level design:** (https://asafpamzn.github.io/criu/cow-dump-high-level-design.html)
- **Detailed design:**  (https://asafpamzn.github.io/criu/cow-dump-design.html)

---

Happy to share the full benchmark harness and raw traces in follow-up comments if there's interest.

## FAQ

### Q: Doesn’t DollySave transfer more data than `BGSAVE`?
**A:** It depends on the workload.

- When the write rate is **near zero**, this can be true. DollySave transfers the full memory image, including structures like copy-on-write buffers, replication backlog, and metadata. This can be optimized further (e.g., by excluding specific memory addresses  from transfer).
- Under **realistic or high write rates**, the situation reverses. `BGSAVE` runs for much longer, so it ends up copying a large number of stale pages which are non relevant.  
  **Because DollySave completes much faster, it typically transfers less total data in practice.**

---

### Q: What about compression efficiency?
**A:** DollySave achieves better compression ratios.

- DollySave compresses **large contiguous memory chunks (e.g., ~1 MB)**.
- `BGSAVE` compresses **individual serialized values**, which limits compression efficiency.

This results in **higher compression ratios and better throughput** for DollySave.

---

### Q: What about threads, mutexes, and file descriptors?
**A:** These are handled by **CRIU **.

CRIU captures and restores the **entire process state**, including:
- threads
- mutexes
- file descriptors
- sockets

We validated this with a **multi-threaded Valkey setup (16 IO threads)** under **~500K TPS**, and it behaved correctly.

---

### Q: Can DollySave be used for version upgrades?
**A:** Not directly.

DollySave transfers a **live process image**, so it requires identical binaries and is **not suitable for upgrading versions**.

A possible workflow is a **two-phase approach**:
1. Use DollySave to migrate to a temporary machine (fast, low latency impact).
2. Run `BGSAVE` from that machine to a new instance running the upgraded version.

This approach:
- improves **user experience** (shorter impact window),
- allows using a **larger temporary machine** to accelerate the upgrade.

However, it introduces **control-plane complexity**, so it may not fit all environments.

---

### Q: What about `munmap` / memory unmapping events?
**A:** These are handled transparently by **CRIU**.

CRIU receives notifications from the kernel and correctly tracks memory unmap/remap events during the process.

---

### Q: Can DollySave be used for slot migration?
**A:** Potentially yes, but it requires additional Valkey changes.

One possible approach:
- Organize keyspaces so that **different slot ranges map to separate VMAs**.
- Use DollySave to transfer only the relevant VMA(s).

This could enable **very fast slot migration**, but:
- requires tighter integration with Valkey memory layout,
- adds complexity,
- and needs further design discussion.

---

### Q: Can DollySave be used for snapshots?
**A:** Yes, but the format is **not RDB**.

DollySave produces a **process-level snapshot** rather than a serialized RDB file.

- Restore is typically **much faster**, since it involves loading a process image instead of parsing and rebuilding data structures.
- In our tests, recovery of a **300 GB dataset took ~20 seconds** (assuming sufficiently fast storage).

This makes it well-suited for **fast recovery scenarios**, though it differs from traditional RDB-based workflows.


---


### Q: Why not just improve `BGSAVE`?
**A:** Because the core limitations are **fundamental to its design**.

`BGSAVE` relies on `fork()` and kernel copy-on-write:
- Memory overhead scales with the **write rate** (COW amplification).
- Snapshot duration scales with **dataset size and serialization cost**.
- High write pressure can lead to **unbounded memory growth and OOM**.

These are not easy to fix incrementally—they stem from:
- duplicating memory via COW,
- and serializing objects into RDB format.

DollySave takes a **different approach entirely**:
- no `fork()`
- no object serialization
- no COW amplification

Instead, it streams memory directly while tracking dirtied pages, which is why it can:
- complete much faster
- use predictable memory
- and remain stable under heavy write workloads

---

## Acknowledgment
This issue is based on work and discussions with @avifenesh and @mo-amzn .
Thanks for the help in valuable POCs and experiments that improved the solution significantly. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] A new SAVE path for Valkey: DollySave - up to 35× faster, no COW, OOM-safe #3609

Setup

Scenario 1 - idle server: how much does SAVE cost when nothing is happening?

Scenario 2 - add 400K read TPS: what happens under realistic reads?

Latency recovery after SAVE is called

Scenario 3 - add 100K write TPS:

Latency recovery after SAVE is called (same story, bigger gap)

Scenario 4 - single thread, small box: is DollySave just "parallel"?

Summary across all four scenarios

What this costs Valkey to support

How it works

FAQ

Q: Doesn’t DollySave transfer more data than `BGSAVE`?

Q: What about compression efficiency?

Q: What about threads, mutexes, and file descriptors?

Q: Can DollySave be used for version upgrades?

Q: What about `munmap` / memory unmapping events?

Q: Can DollySave be used for slot migration?

Q: Can DollySave be used for snapshots?

Q: Why not just improve `BGSAVE`?

Acknowledgment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	BGSAVE	DollySave
Duration	1102.31 s	31.24 s (35.3× faster)
Extra memory during SAVE	2.28 GB (COW)	37 MB
Process freeze time	4207 ms	233 ms (−3974 ms)

	BGSAVE	DollySave
Duration	1084 s	32 s (33.9× faster)
Extra memory during SAVE	24.0 GB (COW)	97 MB
Process freeze time	4173 ms	1216 ms (−2957 ms)

	BGSAVE	DollySave
Duration	527 s then OOM crash	32.99 s
Extra memory during SAVE	~200 GB (COW) - host OOM	144 MB
Process freeze time	8346 ms	1736 ms (−6610 ms)

	BGSAVE	DollySave
Duration	149.68 s	23.16 s (6.46× faster)
Extra memory during SAVE	149 MB	9 MB
Process freeze time	329 ms	85 ms (−244 ms)

Scenario	Duration (BGSAVE → Dolly)	Extra RSS (BGSAVE → Dolly)	Freeze (BGSAVE → Dolly)
1. Idle	1102 s → 31 s (35×)	2.28 GB → 37 MB	4207 ms → 233 ms
2. 400K GET + 1.5K SET	1084 s → 32 s (34×)	24.0 GB → 97 MB	4173 ms → 1216 ms
3. 400K GET + 100K SET	527 s + OOM → 33 s	~200 GB + OOM → 144 MB	8346 ms → 1736 ms
4. Single-thread (small instance)	150 s → 23 s (6.5×)	149 MB → 9 MB	329 ms → 85 ms

[NEW] A new SAVE path for Valkey: DollySave - up to 35× faster, no COW, OOM-safe #3609

Description

Setup

Scenario 1 - idle server: how much does SAVE cost when nothing is happening?

Scenario 2 - add 400K read TPS: what happens under realistic reads?

Latency recovery after SAVE is called

Scenario 3 - add 100K write TPS:

Latency recovery after SAVE is called (same story, bigger gap)

Scenario 4 - single thread, small box: is DollySave just "parallel"?

Summary across all four scenarios

What this costs Valkey to support

How it works

FAQ

Q: Doesn’t DollySave transfer more data than BGSAVE?

Q: What about compression efficiency?

Q: What about threads, mutexes, and file descriptors?

Q: Can DollySave be used for version upgrades?

Q: What about munmap / memory unmapping events?

Q: Can DollySave be used for slot migration?

Q: Can DollySave be used for snapshots?

Q: Why not just improve BGSAVE?

Acknowledgment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Q: Doesn’t DollySave transfer more data than `BGSAVE`?

Q: What about `munmap` / memory unmapping events?

Q: Why not just improve `BGSAVE`?