TL;DR: BGSAVE takes ~18 minutes and can OOM on a 300 GB dataset; DollySave completes in ~30 seconds and never OOM.
BGSAVE has served Valkey well, but on multi-hundred-GB instances it has several issues: it takes a long time, its memory overhead grows with the write rate (copy-on-write), and in the worst case it runs the host out of memory. The duration of SAVE is critical for operational workflows such as scale-up, increasing the number of replicas, or recovering after a connection loss. In these scenarios, replicas must become active quickly; otherwise, read traffic is redirected to other nodes and can overwhelm them, leading to cascading performance degradation.
We prototyped DollySave, a drop-in alternative SAVE path. On a 300 GB dataset it is up to 35× faster, uses a constant, tiny amount of extra memory regardless of write pressure (instead of scaling with the write rate the way BGSAVE does), shortens the process freeze by up to ~6.6 s, recovers p50/p99 in ~3–4 s
instead of ~19–49 s, and most importantly never OOMs, even under heavy write traffic that kills BGSAVE today.
The Valkey-side change required to support this is tiny: one new admin command, no changes to the existing BGSAVE path. PR link: #3608
The rest of this issue walks through the benchmarks scenario by scenario, building up from a quiet server to a write-heavy one, so you can see exactly where BGSAVE falls over and where DollySave holds up.
Setup
- Instance:
r7g.16xlarge (Graviton3, 64 vCPU, 512 GB RAM)
- Dataset: 4,500,000 keys × 512 bytes ≈ 300 GB
Scenario 1 - idle server: how much does SAVE cost when nothing is happening?
Before we pile on traffic, let's ask the simplest question: if the server is doing nothing at all, how expensive is it just to take a snapshot?
|
BGSAVE |
DollySave |
| Duration |
1102.31 s |
31.24 s (35.3× faster) |
| Extra memory during SAVE |
2.28 GB (COW) |
37 MB |
| Process freeze time |
4207 ms |
233 ms (−3974 ms) |
On an idle 300 GB server, BGSAVE takes over 18 minutes. DollySave finishes in 31 seconds. The process-freeze interval, the window during which Valkey cannot serve any command, drops from 4.2 s to 233 ms.
No traffic, no writes, no contention. This is the best case for BGSAVE and it's still 35× slower than DollySave.
Scenario 2 - add 400K read TPS: what happens under realistic reads?
Now let's put the server under a realistic read load. We keep writes light (1.5K SET/s) and push reads up to a steady 400,000 GET/s.
|
BGSAVE |
DollySave |
| Duration |
1084 s |
32 s (33.9× faster) |
| Extra memory during SAVE |
24.0 GB (COW) |
97 MB |
| Process freeze time |
4173 ms |
1216 ms (−2957 ms) |
Duration is essentially unchanged from Scenario 1 even though we've added a large read load, only the 1,500 SET/s actually dirty pages, so BGSAVE's COW cost stays relatively manageable.
Look at the extra-memory row, though: BGSAVE has already jumped from 2.28 GB (Scenario 1) to 24 GB, and we haven't even turned writes up yet. DollySave went from 37 MB to 97 MB.
The more interesting result here is how long it takes p50 / p99 to recover after SAVE is called.
Latency recovery after SAVE is called
Both SAVE paths cause a latency disturbance to user read and write traffic, we care about how long each one takes to come back to steady state.
-
BGSAVE forks the process, and every subsequent write from the parent forces the kernel to copy a page out of shared memory. That copy-on-write traffic stays elevated throughout the SAVE.
-
DollySave asks the kernel to write-protect memory so it can track which pages get dirtied. When the application writes to a tracked page, the kernel just marks the page as dirty no copies, no userspace fault handler.
So both series below start at t = 0 = post the freeze time, and we measure how long p50 and p99 take to recover to steady state.
- Four series on a shared x-axis (seconds since SAVE was called).
- Y-axis is log-scaled. *
Headlines from the data (with t = 0 = SAVE called, process-freeze samples excluded):
- BGSAVE p99 enters the post-freeze window at ~5 ms and takes ~19 s to return to steady state.
- BGSAVE p50 peaks at 4.53 ms, ~18 s to return.
- DollySave p99 peaks at 2.66 ms, back to steady state in ~3 s.
- DollySave p50 peaks at 2.17 ms, same ~3 s recovery.
For any service with an SLO on p99, the recovery window matters as much as the SAVE duration itself. A ~19-second window during which p99 stays elevated is a real, user-visible event.
Scenario 3 - add 100K write TPS:
Now we add the workload that breaks BGSAVE: 400K GET/s + 100K SET/s on the same 300 GB dataset.
This is the scenario where BGSAVE's fundamental design fork() and let the kernel copy-on-write stops being a tradeoff and starts being a liability.
Remember the setup: we're on a 512 GB machine using only 300 GB for Valkey data, which leaves ~200 GB of headroom. Under this workload, BGSAVE's COW cost alone consumed that entire ~200 GB
of headroom before the snapshot could finish at which point the host ran out of memory.
|
BGSAVE |
DollySave |
| Duration |
527 s then OOM crash |
32.99 s |
| Extra memory during SAVE |
~200 GB (COW) - host OOM |
144 MB |
| Process freeze time |
8346 ms |
1736 ms (−6610 ms) |
DollySave used 144 MB here in the same ballpark as its idle-server number (37 MB) and its read-heavy number (97 MB). The extra-memory cost does not scale with the write rate. It completed the snapshot in 33 seconds and never came close to OOM.
Latency recovery after SAVE is called (same story, bigger gap)
Same four-series picture as Scenario 2 - both series start at t = 0 = post freeze time, process-freeze samples excluded, and we measure how long p50 and p99 take to recover to steady state.
*Same four series as Scenario 2, under heavier write load. DollySave (blue) drops back to steady state after ~4 s; BGSAVE (red)
takes ~49 s. Y-axis is log-scaled. *
Recovery time (with t = 0 = SAVE called, process-freeze samples excluded):
- BGSAVE p99 enters the post-freeze window at ~6.6 ms and takes ~49 s to return to steady state.
- BGSAVE p50 takes ~48 s to recover.
- DollySave p99 is back to steady state in ~4 s.
- DollySave p50 is back to steady state in ~4 s.
Scenario 4 - single thread, small box: is DollySave just "parallel"?
A fair skeptic at this point says: "You're beating BGSAVE because you've thrown more threads at it. Hold parallelism constant and the advantage disappears."
So we ran Scenario 4 on a smaller r7g.xlarge with a single sender thread and 220 million small keys (50 B each) - only 23GB, different shape.
|
BGSAVE |
DollySave |
| Duration |
149.68 s |
23.16 s (6.46× faster) |
| Extra memory during SAVE |
149 MB |
9 MB |
| Process freeze time |
329 ms |
85 ms (−244 ms) |
Even single-threaded, with none of the parallel-sender machinery engaged, DollySave is 6.5× faster than BGSAVE. The advantage isn't coming only from threads it's coming also from the fact that we send raw memory with no serialization. DollySave avoids object serialization entirely by streaming raw memory, which removes a major CPU and latency component inherent in RDB generation. We have not tested complex objects that may increase the duration gap even further.
Summary across all four scenarios
| Scenario |
Duration (BGSAVE → Dolly) |
Extra RSS (BGSAVE → Dolly) |
Freeze (BGSAVE → Dolly) |
| 1. Idle |
1102 s → 31 s (35×) |
2.28 GB → 37 MB |
4207 ms → 233 ms |
| 2. 400K GET + 1.5K SET |
1084 s → 32 s (34×) |
24.0 GB → 97 MB |
4173 ms → 1216 ms |
| 3. 400K GET + 100K SET |
527 s + OOM → 33 s |
~200 GB + OOM → 144 MB |
8346 ms → 1736 ms |
| 4. Single-thread (small instance) |
150 s → 23 s (6.5×) |
149 MB → 9 MB |
329 ms → 85 ms |
What this costs Valkey to support
Almost nothing. One new admin command CLEAN_STATE_FOR_DOLLY_SAVE ~240 LoC, run once on the restored target to clear source-host identity (runid, cluster node-id, peers, epochs) while preserving replication state so the target can PSYNC partial-resync against its new primary. The BGSAVE/fork() path is completely untouched.
Full PR (draft): #3608
Scope disclaimer. DollySave has only been tested in a standalone Valkey deployment so far, the focus of the work to date has been on the CRIU-side dump/restore logic, not on Valkey-side cluster behaviour. The PR linked above is a draft intended to scope the Valkey-side change and start the conversation; once the community aligns on direction, we'll expand testing to cover the cluster scenarios (failover, slot migration, multi-shard PSYNC, etc.) and add corresponding tests.
How it works
DollySave does not fork() the process. Instead, it treats a snapshot as a live process migration: the process keeps running and serving traffic while its memory is streamed out; only a brief final pass runs while the process is frozen.
Three moves, at a high level:
- Track writes without copying. Ask the kernel to write-protect the process's memory (via
UFFD_FEATURE_WP_ASYNC) so we can later ask which pages got dirtied. When the process writes to a tracked page, the kernel just marks it as dirty no copies, no userspace fault handler, no COW page duplication. Dirty bits are read back in bulk via PAGEMAP_SCAN.
- Stream memory while the process runs. Parallel workers copy memory out, compressed. Pages that get re-dirtied are simply re-sent and overwrite the older copy on the receiving side. The process keeps serving reads and writes the whole time.
- Brief final freeze. Once the dirty set has converged, freeze the process just long enough to capture the last-moment dirty pages and process-tree metadata, then unfreeze. This is the only part of SAVE during which Valkey cannot serve commands and it's measured in hundreds of milliseconds to ~1.7 s in our tests, instead of the multi-second freezes
BGSAVE produces.
A second, freeze-free variant exists. We also implemented a synchronous WP mode (UFFD_FEATURE_WP, not WP_ASYNC) in which the kernel delivers a userfaultfd event on every write and the dump path ships each page as it is written. That variant has no final freeze at all - but the per-write userspace round-trip imposes a real p50/p99 tax on the application. We chose the async/final-freeze design for the results shown above because it gives better application latency during SAVE. The sync variant is still in the tree and may be a better fit for workloads that cannot tolerate any freeze, at the cost of elevated tail latency throughout SAVE.
The heavy lifting (write-protect tracking, parallel transfer, restore coordination) lives in an experimental fork of CRIU. From Valkey's perspective, the entire dump mechanism is external which is why the Valkey-side change is just the one admin command above.
Full design docs:
Happy to share the full benchmark harness and raw traces in follow-up comments if there's interest.
FAQ
Q: Doesn’t DollySave transfer more data than BGSAVE?
A: It depends on the workload.
- When the write rate is near zero, this can be true. DollySave transfers the full memory image, including structures like copy-on-write buffers, replication backlog, and metadata. This can be optimized further (e.g., by excluding specific memory addresses from transfer).
- Under realistic or high write rates, the situation reverses.
BGSAVE runs for much longer, so it ends up copying a large number of stale pages which are non relevant.
Because DollySave completes much faster, it typically transfers less total data in practice.
Q: What about compression efficiency?
A: DollySave achieves better compression ratios.
- DollySave compresses large contiguous memory chunks (e.g., ~1 MB).
BGSAVE compresses individual serialized values, which limits compression efficiency.
This results in higher compression ratios and better throughput for DollySave.
Q: What about threads, mutexes, and file descriptors?
A: These are handled by **CRIU **.
CRIU captures and restores the entire process state, including:
- threads
- mutexes
- file descriptors
- sockets
We validated this with a multi-threaded Valkey setup (16 IO threads) under ~500K TPS, and it behaved correctly.
Q: Can DollySave be used for version upgrades?
A: Not directly.
DollySave transfers a live process image, so it requires identical binaries and is not suitable for upgrading versions.
A possible workflow is a two-phase approach:
- Use DollySave to migrate to a temporary machine (fast, low latency impact).
- Run
BGSAVE from that machine to a new instance running the upgraded version.
This approach:
- improves user experience (shorter impact window),
- allows using a larger temporary machine to accelerate the upgrade.
However, it introduces control-plane complexity, so it may not fit all environments.
Q: What about munmap / memory unmapping events?
A: These are handled transparently by CRIU.
CRIU receives notifications from the kernel and correctly tracks memory unmap/remap events during the process.
Q: Can DollySave be used for slot migration?
A: Potentially yes, but it requires additional Valkey changes.
One possible approach:
- Organize keyspaces so that different slot ranges map to separate VMAs.
- Use DollySave to transfer only the relevant VMA(s).
This could enable very fast slot migration, but:
- requires tighter integration with Valkey memory layout,
- adds complexity,
- and needs further design discussion.
Q: Can DollySave be used for snapshots?
A: Yes, but the format is not RDB.
DollySave produces a process-level snapshot rather than a serialized RDB file.
- Restore is typically much faster, since it involves loading a process image instead of parsing and rebuilding data structures.
- In our tests, recovery of a 300 GB dataset took ~20 seconds (assuming sufficiently fast storage).
This makes it well-suited for fast recovery scenarios, though it differs from traditional RDB-based workflows.
Q: Why not just improve BGSAVE?
A: Because the core limitations are fundamental to its design.
BGSAVE relies on fork() and kernel copy-on-write:
- Memory overhead scales with the write rate (COW amplification).
- Snapshot duration scales with dataset size and serialization cost.
- High write pressure can lead to unbounded memory growth and OOM.
These are not easy to fix incrementally—they stem from:
- duplicating memory via COW,
- and serializing objects into RDB format.
DollySave takes a different approach entirely:
- no
fork()
- no object serialization
- no COW amplification
Instead, it streams memory directly while tracking dirtied pages, which is why it can:
- complete much faster
- use predictable memory
- and remain stable under heavy write workloads
Acknowledgment
This issue is based on work and discussions with @avifenesh and @mo-amzn .
Thanks for the help in valuable POCs and experiments that improved the solution significantly.
TL;DR: BGSAVE takes ~18 minutes and can OOM on a 300 GB dataset; DollySave completes in ~30 seconds and never OOM.
BGSAVEhas served Valkey well, but on multi-hundred-GB instances it has several issues: it takes a long time, its memory overhead grows with the write rate (copy-on-write), and in the worst case it runs the host out of memory. The duration of SAVE is critical for operational workflows such as scale-up, increasing the number of replicas, or recovering after a connection loss. In these scenarios, replicas must become active quickly; otherwise, read traffic is redirected to other nodes and can overwhelm them, leading to cascading performance degradation.We prototyped DollySave, a drop-in alternative SAVE path. On a 300 GB dataset it is up to 35× faster, uses a constant, tiny amount of extra memory regardless of write pressure (instead of scaling with the write rate the way
BGSAVEdoes), shortens the process freeze by up to ~6.6 s, recovers p50/p99 in ~3–4 sinstead of ~19–49 s, and most importantly never OOMs, even under heavy write traffic that kills
BGSAVEtoday.The Valkey-side change required to support this is tiny: one new admin command, no changes to the existing
BGSAVEpath. PR link: #3608The rest of this issue walks through the benchmarks scenario by scenario, building up from a quiet server to a write-heavy one, so you can see exactly where
BGSAVEfalls over and where DollySave holds up.Setup
r7g.16xlarge(Graviton3, 64 vCPU, 512 GB RAM)Scenario 1 - idle server: how much does SAVE cost when nothing is happening?
Before we pile on traffic, let's ask the simplest question: if the server is doing nothing at all, how expensive is it just to take a snapshot?
On an idle 300 GB server,
BGSAVEtakes over 18 minutes. DollySave finishes in 31 seconds. The process-freeze interval, the window during which Valkey cannot serve any command, drops from 4.2 s to 233 ms.No traffic, no writes, no contention. This is the best case for
BGSAVEand it's still 35× slower than DollySave.Scenario 2 - add 400K read TPS: what happens under realistic reads?
Now let's put the server under a realistic read load. We keep writes light (1.5K SET/s) and push reads up to a steady 400,000 GET/s.
Duration is essentially unchanged from Scenario 1 even though we've added a large read load, only the 1,500 SET/s actually dirty pages, so
BGSAVE's COW cost stays relatively manageable.Look at the extra-memory row, though:
BGSAVEhas already jumped from 2.28 GB (Scenario 1) to 24 GB, and we haven't even turned writes up yet. DollySave went from 37 MB to 97 MB.The more interesting result here is how long it takes p50 / p99 to recover after SAVE is called.
Latency recovery after SAVE is called
Both SAVE paths cause a latency disturbance to user read and write traffic, we care about how long each one takes to come back to steady state.
BGSAVEforks the process, and every subsequent write from the parent forces the kernel to copy a page out of shared memory. That copy-on-write traffic stays elevated throughout the SAVE.DollySave asks the kernel to write-protect memory so it can track which pages get dirtied. When the application writes to a tracked page, the kernel just marks the page as dirty no copies, no userspace fault handler.
So both series below start at
t = 0= post the freeze time, and we measure how long p50 and p99 take to recover to steady state.Headlines from the data (with
t = 0= SAVE called, process-freeze samples excluded):For any service with an SLO on p99, the recovery window matters as much as the SAVE duration itself. A ~19-second window during which p99 stays elevated is a real, user-visible event.
Scenario 3 - add 100K write TPS:
Now we add the workload that breaks
BGSAVE: 400K GET/s + 100K SET/s on the same 300 GB dataset.This is the scenario where
BGSAVE's fundamental designfork()and let the kernel copy-on-write stops being a tradeoff and starts being a liability.Remember the setup: we're on a 512 GB machine using only 300 GB for Valkey data, which leaves ~200 GB of headroom. Under this workload,
BGSAVE's COW cost alone consumed that entire ~200 GBof headroom before the snapshot could finish at which point the host ran out of memory.
DollySave used 144 MB here in the same ballpark as its idle-server number (37 MB) and its read-heavy number (97 MB). The extra-memory cost does not scale with the write rate. It completed the snapshot in 33 seconds and never came close to OOM.
Latency recovery after SAVE is called (same story, bigger gap)
Same four-series picture as Scenario 2 - both series start at
t = 0= post freeze time, process-freeze samples excluded, and we measure how long p50 and p99 take to recover to steady state.*Same four series as Scenario 2, under heavier write load. DollySave (blue) drops back to steady state after ~4 s; BGSAVE (red)
takes ~49 s. Y-axis is log-scaled. *
Recovery time (with
t = 0= SAVE called, process-freeze samples excluded):Scenario 4 - single thread, small box: is DollySave just "parallel"?
A fair skeptic at this point says: "You're beating
BGSAVEbecause you've thrown more threads at it. Hold parallelism constant and the advantage disappears."So we ran Scenario 4 on a smaller
r7g.xlargewith a single sender thread and 220 million small keys (50 B each) - only 23GB, different shape.Even single-threaded, with none of the parallel-sender machinery engaged, DollySave is 6.5× faster than
BGSAVE. The advantage isn't coming only from threads it's coming also from the fact that we send raw memory with no serialization. DollySave avoids object serialization entirely by streaming raw memory, which removes a major CPU and latency component inherent in RDB generation. We have not tested complex objects that may increase the duration gap even further.Summary across all four scenarios
What this costs Valkey to support
Almost nothing. One new admin command
CLEAN_STATE_FOR_DOLLY_SAVE~240 LoC, run once on the restored target to clear source-host identity (runid, cluster node-id, peers, epochs) while preserving replication state so the target canPSYNCpartial-resync against its new primary. TheBGSAVE/fork()path is completely untouched.Full PR (draft): #3608
How it works
DollySave does not
fork()the process. Instead, it treats a snapshot as a live process migration: the process keeps running and serving traffic while its memory is streamed out; only a brief final pass runs while the process is frozen.Three moves, at a high level:
UFFD_FEATURE_WP_ASYNC) so we can later ask which pages got dirtied. When the process writes to a tracked page, the kernel just marks it as dirty no copies, no userspace fault handler, no COW page duplication. Dirty bits are read back in bulk viaPAGEMAP_SCAN.BGSAVEproduces.The heavy lifting (write-protect tracking, parallel transfer, restore coordination) lives in an experimental fork of CRIU. From Valkey's perspective, the entire dump mechanism is external which is why the Valkey-side change is just the one admin command above.
Full design docs:
Happy to share the full benchmark harness and raw traces in follow-up comments if there's interest.
FAQ
Q: Doesn’t DollySave transfer more data than
BGSAVE?A: It depends on the workload.
BGSAVEruns for much longer, so it ends up copying a large number of stale pages which are non relevant.Because DollySave completes much faster, it typically transfers less total data in practice.
Q: What about compression efficiency?
A: DollySave achieves better compression ratios.
BGSAVEcompresses individual serialized values, which limits compression efficiency.This results in higher compression ratios and better throughput for DollySave.
Q: What about threads, mutexes, and file descriptors?
A: These are handled by **CRIU **.
CRIU captures and restores the entire process state, including:
We validated this with a multi-threaded Valkey setup (16 IO threads) under ~500K TPS, and it behaved correctly.
Q: Can DollySave be used for version upgrades?
A: Not directly.
DollySave transfers a live process image, so it requires identical binaries and is not suitable for upgrading versions.
A possible workflow is a two-phase approach:
BGSAVEfrom that machine to a new instance running the upgraded version.This approach:
However, it introduces control-plane complexity, so it may not fit all environments.
Q: What about
munmap/ memory unmapping events?A: These are handled transparently by CRIU.
CRIU receives notifications from the kernel and correctly tracks memory unmap/remap events during the process.
Q: Can DollySave be used for slot migration?
A: Potentially yes, but it requires additional Valkey changes.
One possible approach:
This could enable very fast slot migration, but:
Q: Can DollySave be used for snapshots?
A: Yes, but the format is not RDB.
DollySave produces a process-level snapshot rather than a serialized RDB file.
This makes it well-suited for fast recovery scenarios, though it differs from traditional RDB-based workflows.
Q: Why not just improve
BGSAVE?A: Because the core limitations are fundamental to its design.
BGSAVErelies onfork()and kernel copy-on-write:These are not easy to fix incrementally—they stem from:
DollySave takes a different approach entirely:
fork()Instead, it streams memory directly while tracking dirtied pages, which is why it can:
Acknowledgment
This issue is based on work and discussions with @avifenesh and @mo-amzn .
Thanks for the help in valuable POCs and experiments that improved the solution significantly.