Skip to content

Add cluster-persist-config option to control nodes.conf save behavior#3372

Open
enjoy-binbin wants to merge 2 commits intovalkey-io:unstablefrom
enjoy-binbin:cluster_config_save
Open

Add cluster-persist-config option to control nodes.conf save behavior#3372
enjoy-binbin wants to merge 2 commits intovalkey-io:unstablefrom
enjoy-binbin:cluster_config_save

Conversation

@enjoy-binbin
Copy link
Member

This commit introduces a new configuration option cluster-persist-config
that controls how the cluster handles nodes.conf file save failures.

The option supports two modes:

  • sync (default): Synchronously save the config file. If the save fails,
    the process exits. This maintains backward compatibility with the old
    behavior (before 9.1).
  • best-effort: Synchronously save the config file. If the save fails,
    only log a warning and continue running. This allows the node to survive
    disk failures (e.g., disk full, read-only filesystem) without exitting,
    giving administrators time to address the issue.

Note that this modifies the behavior of #1032, whereas #1032 was "best-effort",
we have now introduced a configuration option that defaults to "sync."
See #1032 discussion for more details.

Background:
When a disk becomes read-only or full, any cluster metadata change would
trigger a nodes.conf save attempt. With the old behavior, the node would
immediately exit via clusterSaveConfigOrDie(), potentially causing multiple
nodes on the same machine to crash simultaneously, leading to cluster
unavailability.

The new best-effort mode addresses this by allowing nodes to continue
operating even when disk writes fail. This is particularly useful in cloud
environments where disk failures are more common due to scale.

Note: Startup-time config saves (in clusterInit and verifyClusterConfigWithData)
still use clusterSaveConfigOrDie() since disk issues at startup should cause
immediate failure.

This commit introduces a new configuration option `cluster-persist-config`
that controls how the cluster handles nodes.conf file save failures.

The option supports two modes:
- `sync` (default): Synchronously save the config file. If the save fails,
  the process exits. This maintains backward compatibility with the old
  behavior (before 9.1).
- `best-effort`: Synchronously save the config file. If the save fails,
  only log a warning and continue running. This allows the node to survive
  disk failures (e.g., disk full, read-only filesystem) without exitting,
  giving administrators time to address the issue.

Note that this modifies the behavior of valkey-io#1032, whereas valkey-io#1032 was "best-effort",
we have now introduced a configuration option that defaults to "sync."
See valkey-io#1032 discussion for more details.

Background:
When a disk becomes read-only or full, any cluster metadata change would
trigger a nodes.conf save attempt. With the old behavior, the node would
immediately exit via clusterSaveConfigOrDie(), potentially causing multiple
nodes on the same machine to crash simultaneously, leading to cluster
unavailability.

The new `best-effort` mode addresses this by allowing nodes to continue
operating even when disk writes fail. This is particularly useful in cloud
environments where disk failures are more common due to scale.

Note: Startup-time config saves (in clusterInit and verifyClusterConfigWithData)
still use clusterSaveConfigOrDie() since disk issues at startup should cause
immediate failure.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin moved this to Optional for next release candidate in Valkey 9.1 Mar 17, 2026
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.50%. Comparing base (6c329df) to head (b7f97d5).
⚠️ Report is 3 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3372      +/-   ##
============================================
- Coverage     74.55%   74.50%   -0.05%     
============================================
  Files           130      130              
  Lines         72731    72734       +3     
============================================
- Hits          54225    54192      -33     
- Misses        18506    18542      +36     
Files with missing lines Coverage Δ
src/cluster_legacy.c 88.16% <100.00%> (-0.16%) ⬇️
src/config.c 77.70% <ø> (ø)
src/server.h 100.00% <ø> (ø)

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast
Copy link
Contributor

When we merge #2555 we can change 'best-effort' to be in bio?

Copy link
Contributor

@murphyjacob4 murphyjacob4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

We had quorum in Monday's meeting and got the verbal consensus. FYI @valkey-io/core-team in case anyone wasn't there and wants to chime in

Copy link
Contributor

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we name it as cluster-config-save-behavior ?

@enjoy-binbin
Copy link
Member Author

When we merge #2555 we can change 'best-effort' to be in bio?

Yes, we can, it is best-effort.

Shall we name it as cluster-config-save-behavior ?

Sound like a better (a more specific one) name. I am open to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-approved Major decision approved by TSC team

Projects

Status: Optional for next release candidate

Development

Successfully merging this pull request may close these issues.

4 participants