Skip to content

[NEW] A new SAVE path for Valkey: DollySave - up to 35× faster, no COW, OOM-safe #3609

@asafpamzn

Description

@asafpamzn

TL;DR: BGSAVE takes ~18 minutes and can OOM on a 300 GB dataset; DollySave completes in ~30 seconds and never OOM.

BGSAVE has served Valkey well, but on multi-hundred-GB instances it has several issues: it takes a long time, its memory overhead grows with the write rate (copy-on-write), and in the worst case it runs the host out of memory. The duration of SAVE is critical for operational workflows such as scale-up, increasing the number of replicas, or recovering after a connection loss. In these scenarios, replicas must become active quickly; otherwise, read traffic is redirected to other nodes and can overwhelm them, leading to cascading performance degradation.

We prototyped DollySave, a drop-in alternative SAVE path. On a 300 GB dataset it is up to 35× faster, uses a constant, tiny amount of extra memory regardless of write pressure (instead of scaling with the write rate the way BGSAVE does), shortens the process freeze by up to ~6.6 s, recovers p50/p99 in ~3–4 s
instead of ~19–49 s
, and most importantly never OOMs, even under heavy write traffic that kills BGSAVE today.

The Valkey-side change required to support this is tiny: one new admin command, no changes to the existing BGSAVE path. PR link: #3608

The rest of this issue walks through the benchmarks scenario by scenario, building up from a quiet server to a write-heavy one, so you can see exactly where BGSAVE falls over and where DollySave holds up.


Setup

  • Instance: r7g.16xlarge (Graviton3, 64 vCPU, 512 GB RAM)
  • Dataset: 4,500,000 keys × 512 bytes ≈ 300 GB

Scenario 1 - idle server: how much does SAVE cost when nothing is happening?

Before we pile on traffic, let's ask the simplest question: if the server is doing nothing at all, how expensive is it just to take a snapshot?

BGSAVE DollySave
Duration 1102.31 s 31.24 s (35.3× faster)
Extra memory during SAVE 2.28 GB (COW) 37 MB
Process freeze time 4207 ms 233 ms (−3974 ms)

On an idle 300 GB server, BGSAVE takes over 18 minutes. DollySave finishes in 31 seconds. The process-freeze interval, the window during which Valkey cannot serve any command, drops from 4.2 s to 233 ms.

No traffic, no writes, no contention. This is the best case for BGSAVE and it's still 35× slower than DollySave.


Scenario 2 - add 400K read TPS: what happens under realistic reads?

Now let's put the server under a realistic read load. We keep writes light (1.5K SET/s) and push reads up to a steady 400,000 GET/s.

BGSAVE DollySave
Duration 1084 s 32 s (33.9× faster)
Extra memory during SAVE 24.0 GB (COW) 97 MB
Process freeze time 4173 ms 1216 ms (−2957 ms)

Duration is essentially unchanged from Scenario 1 even though we've added a large read load, only the 1,500 SET/s actually dirty pages, so BGSAVE's COW cost stays relatively manageable.

Look at the extra-memory row, though: BGSAVE has already jumped from 2.28 GB (Scenario 1) to 24 GB, and we haven't even turned writes up yet. DollySave went from 37 MB to 97 MB.

The more interesting result here is how long it takes p50 / p99 to recover after SAVE is called.

Latency recovery after SAVE is called

Both SAVE paths cause a latency disturbance to user read and write traffic, we care about how long each one takes to come back to steady state.

  • BGSAVE forks the process, and every subsequent write from the parent forces the kernel to copy a page out of shared memory. That copy-on-write traffic stays elevated throughout the SAVE.

  • DollySave asks the kernel to write-protect memory so it can track which pages get dirtied. When the application writes to a tracked page, the kernel just marks the page as dirty no copies, no userspace fault handler.

So both series below start at t = 0 = post the freeze time, and we measure how long p50 and p99 take to recover to steady state.

Image
  • Four series on a shared x-axis (seconds since SAVE was called).
  • Y-axis is log-scaled. *

Headlines from the data (with t = 0 = SAVE called, process-freeze samples excluded):

  • BGSAVE p99 enters the post-freeze window at ~5 ms and takes ~19 s to return to steady state.
  • BGSAVE p50 peaks at 4.53 ms, ~18 s to return.
  • DollySave p99 peaks at 2.66 ms, back to steady state in ~3 s.
  • DollySave p50 peaks at 2.17 ms, same ~3 s recovery.

For any service with an SLO on p99, the recovery window matters as much as the SAVE duration itself. A ~19-second window during which p99 stays elevated is a real, user-visible event.


Scenario 3 - add 100K write TPS:

Now we add the workload that breaks BGSAVE: 400K GET/s + 100K SET/s on the same 300 GB dataset.

This is the scenario where BGSAVE's fundamental design fork() and let the kernel copy-on-write stops being a tradeoff and starts being a liability.

Remember the setup: we're on a 512 GB machine using only 300 GB for Valkey data, which leaves ~200 GB of headroom. Under this workload, BGSAVE's COW cost alone consumed that entire ~200 GB
of headroom
before the snapshot could finish at which point the host ran out of memory.

BGSAVE DollySave
Duration 527 s then OOM crash 32.99 s
Extra memory during SAVE ~200 GB (COW) - host OOM 144 MB
Process freeze time 8346 ms 1736 ms (−6610 ms)

DollySave used 144 MB here in the same ballpark as its idle-server number (37 MB) and its read-heavy number (97 MB). The extra-memory cost does not scale with the write rate. It completed the snapshot in 33 seconds and never came close to OOM.

Latency recovery after SAVE is called (same story, bigger gap)

Same four-series picture as Scenario 2 - both series start at t = 0 = post freeze time, process-freeze samples excluded, and we measure how long p50 and p99 take to recover to steady state.

*Same four series as Scenario 2, under heavier write load. DollySave (blue) drops back to steady state after ~4 s; BGSAVE (red)
takes ~49 s. Y-axis is log-scaled. *

Image

Recovery time (with t = 0 = SAVE called, process-freeze samples excluded):

  • BGSAVE p99 enters the post-freeze window at ~6.6 ms and takes ~49 s to return to steady state.
  • BGSAVE p50 takes ~48 s to recover.
  • DollySave p99 is back to steady state in ~4 s.
  • DollySave p50 is back to steady state in ~4 s.

Scenario 4 - single thread, small box: is DollySave just "parallel"?

A fair skeptic at this point says: "You're beating BGSAVE because you've thrown more threads at it. Hold parallelism constant and the advantage disappears."

So we ran Scenario 4 on a smaller r7g.xlarge with a single sender thread and 220 million small keys (50 B each) - only 23GB, different shape.

BGSAVE DollySave
Duration 149.68 s 23.16 s (6.46× faster)
Extra memory during SAVE 149 MB 9 MB
Process freeze time 329 ms 85 ms (−244 ms)

Even single-threaded, with none of the parallel-sender machinery engaged, DollySave is 6.5× faster than BGSAVE. The advantage isn't coming only from threads it's coming also from the fact that we send raw memory with no serialization. DollySave avoids object serialization entirely by streaming raw memory, which removes a major CPU and latency component inherent in RDB generation. We have not tested complex objects that may increase the duration gap even further.


Summary across all four scenarios

Scenario Duration (BGSAVE → Dolly) Extra RSS (BGSAVE → Dolly) Freeze (BGSAVE → Dolly)
1. Idle 1102 s → 31 s (35×) 2.28 GB → 37 MB 4207 ms → 233 ms
2. 400K GET + 1.5K SET 1084 s → 32 s (34×) 24.0 GB → 97 MB 4173 ms → 1216 ms
3. 400K GET + 100K SET 527 s + OOM → 33 s ~200 GB + OOM → 144 MB 8346 ms → 1736 ms
4. Single-thread (small instance) 150 s → 23 s (6.5×) 149 MB → 9 MB 329 ms → 85 ms

What this costs Valkey to support

Almost nothing. One new admin command CLEAN_STATE_FOR_DOLLY_SAVE ~240 LoC, run once on the restored target to clear source-host identity (runid, cluster node-id, peers, epochs) while preserving replication state so the target can PSYNC partial-resync against its new primary. The BGSAVE/fork() path is completely untouched.

Full PR (draft): #3608

Scope disclaimer. DollySave has only been tested in a standalone Valkey deployment so far, the focus of the work to date has been on the CRIU-side dump/restore logic, not on Valkey-side cluster behaviour. The PR linked above is a draft intended to scope the Valkey-side change and start the conversation; once the community aligns on direction, we'll expand testing to cover the cluster scenarios (failover, slot migration, multi-shard PSYNC, etc.) and add corresponding tests.


How it works

DollySave does not fork() the process. Instead, it treats a snapshot as a live process migration: the process keeps running and serving traffic while its memory is streamed out; only a brief final pass runs while the process is frozen.

Three moves, at a high level:

  1. Track writes without copying. Ask the kernel to write-protect the process's memory (via UFFD_FEATURE_WP_ASYNC) so we can later ask which pages got dirtied. When the process writes to a tracked page, the kernel just marks it as dirty no copies, no userspace fault handler, no COW page duplication. Dirty bits are read back in bulk via PAGEMAP_SCAN.
  2. Stream memory while the process runs. Parallel workers copy memory out, compressed. Pages that get re-dirtied are simply re-sent and overwrite the older copy on the receiving side. The process keeps serving reads and writes the whole time.
  3. Brief final freeze. Once the dirty set has converged, freeze the process just long enough to capture the last-moment dirty pages and process-tree metadata, then unfreeze. This is the only part of SAVE during which Valkey cannot serve commands and it's measured in hundreds of milliseconds to ~1.7 s in our tests, instead of the multi-second freezes BGSAVE produces.

A second, freeze-free variant exists. We also implemented a synchronous WP mode (UFFD_FEATURE_WP, not WP_ASYNC) in which the kernel delivers a userfaultfd event on every write and the dump path ships each page as it is written. That variant has no final freeze at all - but the per-write userspace round-trip imposes a real p50/p99 tax on the application. We chose the async/final-freeze design for the results shown above because it gives better application latency during SAVE. The sync variant is still in the tree and may be a better fit for workloads that cannot tolerate any freeze, at the cost of elevated tail latency throughout SAVE.

The heavy lifting (write-protect tracking, parallel transfer, restore coordination) lives in an experimental fork of CRIU. From Valkey's perspective, the entire dump mechanism is external which is why the Valkey-side change is just the one admin command above.

Full design docs:


Happy to share the full benchmark harness and raw traces in follow-up comments if there's interest.

FAQ

Q: Doesn’t DollySave transfer more data than BGSAVE?

A: It depends on the workload.

  • When the write rate is near zero, this can be true. DollySave transfers the full memory image, including structures like copy-on-write buffers, replication backlog, and metadata. This can be optimized further (e.g., by excluding specific memory addresses from transfer).
  • Under realistic or high write rates, the situation reverses. BGSAVE runs for much longer, so it ends up copying a large number of stale pages which are non relevant.
    Because DollySave completes much faster, it typically transfers less total data in practice.

Q: What about compression efficiency?

A: DollySave achieves better compression ratios.

  • DollySave compresses large contiguous memory chunks (e.g., ~1 MB).
  • BGSAVE compresses individual serialized values, which limits compression efficiency.

This results in higher compression ratios and better throughput for DollySave.


Q: What about threads, mutexes, and file descriptors?

A: These are handled by **CRIU **.

CRIU captures and restores the entire process state, including:

  • threads
  • mutexes
  • file descriptors
  • sockets

We validated this with a multi-threaded Valkey setup (16 IO threads) under ~500K TPS, and it behaved correctly.


Q: Can DollySave be used for version upgrades?

A: Not directly.

DollySave transfers a live process image, so it requires identical binaries and is not suitable for upgrading versions.

A possible workflow is a two-phase approach:

  1. Use DollySave to migrate to a temporary machine (fast, low latency impact).
  2. Run BGSAVE from that machine to a new instance running the upgraded version.

This approach:

  • improves user experience (shorter impact window),
  • allows using a larger temporary machine to accelerate the upgrade.

However, it introduces control-plane complexity, so it may not fit all environments.


Q: What about munmap / memory unmapping events?

A: These are handled transparently by CRIU.

CRIU receives notifications from the kernel and correctly tracks memory unmap/remap events during the process.


Q: Can DollySave be used for slot migration?

A: Potentially yes, but it requires additional Valkey changes.

One possible approach:

  • Organize keyspaces so that different slot ranges map to separate VMAs.
  • Use DollySave to transfer only the relevant VMA(s).

This could enable very fast slot migration, but:

  • requires tighter integration with Valkey memory layout,
  • adds complexity,
  • and needs further design discussion.

Q: Can DollySave be used for snapshots?

A: Yes, but the format is not RDB.

DollySave produces a process-level snapshot rather than a serialized RDB file.

  • Restore is typically much faster, since it involves loading a process image instead of parsing and rebuilding data structures.
  • In our tests, recovery of a 300 GB dataset took ~20 seconds (assuming sufficiently fast storage).

This makes it well-suited for fast recovery scenarios, though it differs from traditional RDB-based workflows.


Q: Why not just improve BGSAVE?

A: Because the core limitations are fundamental to its design.

BGSAVE relies on fork() and kernel copy-on-write:

  • Memory overhead scales with the write rate (COW amplification).
  • Snapshot duration scales with dataset size and serialization cost.
  • High write pressure can lead to unbounded memory growth and OOM.

These are not easy to fix incrementally—they stem from:

  • duplicating memory via COW,
  • and serializing objects into RDB format.

DollySave takes a different approach entirely:

  • no fork()
  • no object serialization
  • no COW amplification

Instead, it streams memory directly while tracking dirtied pages, which is why it can:

  • complete much faster
  • use predictable memory
  • and remain stable under heavy write workloads

Acknowledgment

This issue is based on work and discussions with @avifenesh and @mo-amzn .
Thanks for the help in valuable POCs and experiments that improved the solution significantly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions