Fail to save the cluster config file will not exit the process by enjoy-binbin · Pull Request #1032 · valkey-io/valkey

enjoy-binbin · 2024-09-14T08:12:58Z

In clusterSaveConfigOrDie, we will exit directly when saving
fails. In the case of disk failure, the cluster node will exit
immediately if there are some changes around the cluster.
We call it in many places, mainly in clusterBeforeSleep, that is,
when the cluster configuration changes and we need to save.

Passive exit may bring unexpected effects, such as cluster down.
We think the risk of metadata becoming persistently out of date
is minimal. On the one hand, we have the CLUSTER_WRITABLE_DELAY
logic, which prevents a master node from being rejoined to the
cluster in an unsafe case within 2 seconds.

void clusterUpdateState(void) {
    /* If this is a primary node, wait some time before turning the state
     * into OK, since it is not a good idea to rejoin the cluster as a writable
     * primary, after a reboot, without giving the cluster a chance to
     * reconfigure this node. Note that the delay is calculated starting from
     * the first call to this function and not since the server start, in order
     * to not count the DB loading time. */
    if (first_call_time == 0) first_call_time = mstime();
    if (clusterNodeIsPrimary(myself) && server.cluster->state == CLUSTER_FAIL &&
        mstime() - first_call_time < CLUSTER_WRITABLE_DELAY)
        return;

The remaining potentially worse case is that the node votes twice
in the same epoch. Like we didn't save nodes.conf and the we have
voted for replica X. We reboot and during this time X wins the failover.
After reboot, node Y requests vote for the same epoch and we vote for Y.
Y wins the failover with the same epoch. We have two primaries with
the same epoch. And we get an epoch collissions. It is resolved and some
writes are lost. It's just like a failover, some writes can be lost.
It may be very rare and it is not very bad. We will use the same
CLUSTER_WRITABLE_DELAY logic to make an optimistic judgment and prevent
the node from voting after restarting.

Added a new clusterSaveConfigOrLog, if the save fails, instead of exiting,
we will now just print a warning log. We will replace the clusterSaveConfigOrDie
in clusterBeforeSleep with this. That is, the config save triggered by
beforeSleep now will not exit the process even if the save fails.

…ss from existing when saving fails In clusterSaveConfigOrDie, we will exit directly when saving fails. In the case of disk failure, the cluster node will exit immediately if there are some changes around the cluster. Passive exit may bring unexpected effects, such as cluster down. Provoide exit-on-cluster-config-file-save-error configuration timen to control this behavior. Users who do not rely on nodes.conf can actively migrate it later (or during low traffic periods). Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2024-09-14T08:13:36Z

@PingXie This is a change I wanted to make a while ago, we'll discuss it in a later version.

codecov · 2024-09-14T08:26:54Z

Codecov Report

❌ Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.46%. Comparing base (48e0cbb) to head (d862233).
⚠️ Report is 18 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	89.47%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1032      +/-   ##
============================================
- Coverage     72.47%   72.46%   -0.01%     
============================================
  Files           129      129              
  Lines         70537    70553      +16     
============================================
+ Hits          51121    51129       +8     
- Misses        19416    19424       +8

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`87.60% <89.47%> (+0.40%)`	⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

madolson · 2024-10-01T20:10:37Z

In the case of disk failure, the cluster node will exit
immediately if there are some changes around the cluster.

This adds more states we have to reason about, what exactly is the motivation behind being able to continue without saving to disk? It seems like we maybe should support a mode where we don't have to commit to disk at all. I've also see issues with stuck IO causing the fsync to stall, which also can cause an outage.

enjoy-binbin · 2024-10-06T06:59:28Z

The motivation is not to exit immediately because it is passive (if it is a primary it may cause the cluster to go down and be unable to write for a certain period of time, it it is a replica it may cause some read request to fail in the scenario of read-write separation), and not exiting immediately allows us to do a manual exit later so that we have enough time to handle it.

enjoy-binbin · 2024-12-27T04:05:45Z

I've also see issues with stuck IO causing the fsync to stall, which also can cause an outage.

this is also a problem, i think we should figure out a way to get rid of this latency, we dont rely the nodes.conf very much, and the latency is killing us.

enjoy-binbin · 2024-12-27T04:07:23Z

I'd really appreciate if you guys have some input on this one @zuiderkwast @hpatro @PingXie @madolson

zuiderkwast · 2025-01-02T19:29:07Z

If saving nodes.conf succeeds once and later it fails, then next time we reboot we can end up with an outdated cluster config. I think this is a problem?

Isn't it better to have a mode that doesn't even try to read or write nodes.conf at all, diskless cluster mode?

With some external control plane like a kubernetes operator, you don't really need persistent nodes.conf. If a node disappears or reboots, it's just replaced by a new node. I understand the wish to run a node without this storage bottle-neck.

enjoy-binbin · 2025-01-03T08:42:18Z

If saving nodes.conf succeeds once and later it fails, then next time we reboot we can end up with an outdated cluster config. I think this is a problem?

yes, that will be a problem, so people enable this config should not use the reboot way and should not rely on the nodes.conf

Isn't it better to have a mode that doesn't even try to read or write nodes.conf at all, diskless cluster mode?

this seems like a good idea to me.

With some external control plane like a kubernetes operator, you don't really need persistent nodes.conf. If a node disappears or reboots, it's just replaced by a new node. I understand the wish to run a node without this storage bottle-neck.

yes, this is right. Some cases are difficule to handle, sometimes a disk error is discovered and a node needs to be migrated. But the node may update nodes.conf at anytime before the migration (or in the process of migration since a migration for sure will trigger a nodes.conf's update). If the update fails, the node will exit and the primary go down, cluster go fail, and we will need a node-timeout to trigger the failover.

e231406. on the other hand, i want to first, to add latency around the nodes.conf (WDYT?). We have indeed encountered this save function blocking for 100ms+, or open blocking for a few seconds (in the case of logging)

PingXie · 2025-01-05T22:02:11Z

The "diskless cluster mode," in my view, takes a different approach by delegating configuration management to an external component. However, I don’t think it would work for users who prefer a "self-contained" deployment and servers automatically restarting after a crash, which is the use case I'm considering below.

Allowing a node to continue operating despite failing to save its nodes.conf updates feels similar to a FAIL node rejoining the cluster after an indeterminate amount of time, something the current design already permits. That said, there is a key difference to consider, "unbounded staleness." If we crash the server when writes to nodes.conf fail, we can reasonably expect nodes.conf to remain relatively fresh. However, if we allow the server to continue operating, the eventual crash could occur much later, creating "unbounded staleness." I think this shouldn't be a problem for our epoch mechanism, though, since it only accounts for "binary staleness."

I agree stuck IOs are a bit worse. The failover will still happen, which is good, but the old primary who get stuck in the nodes.conf disk IO won't rejoin the shard as a replica. We could consider making these writes async so we can time out nodes.conf updates.

I also find it problematic that the current design mixes configurations (e.g., IP/port, slots) with states (e.g., FAIL/PFAIL) in the same file. I wonder if separating these concerns might reduce the likelihood of server crashes caused by disk failures.

enjoy-binbin · 2025-01-06T01:45:26Z

The "diskless cluster mode," in my view, takes a different approach by delegating configuration management to an external component. However, I don’t think it would work for users who prefer a "self-contained" deployment and servers automatically restarting after a crash, which is the use case I'm considering below.

yes, it is, it does delegate the conf to external component, and don't benefit these user. Offering a switch seems ok to me, or if there is a better solution.

Allowing a node to continue operating despite failing to save its nodes.conf updates feels similar to a FAIL node rejoining the cluster after an indeterminate amount of time, something the current design already permits. That said, there is a key difference to consider, "unbounded staleness." If we crash the server when writes to nodes.conf fail, we can reasonably expect nodes.conf to remain relatively fresh. However, if we allow the server to continue operating, the eventual crash could occur much later, creating "unbounded staleness." I think this shouldn't be a problem for our epoch mechanism, though, since it only accounts for "binary staleness."

this is a good statement, so should not read or write at all.

I agree stuck IOs are a bit worse. The failover will still happen, which is good, but the old primary who get stuck in the nodes.conf disk IO won't rejoin the shard as a replica. We could consider making these writes async so we can time out nodes.conf updates.

so what happen if a async write fail, should we still do the exit when the async write stuck and eventaully fail? Or if the cluster conf changes multiple times during the async, this will also make nodes.conf become a stale data (we can argue it is in a short window)

I also find it problematic that the current design mixes configurations (e.g., IP/port, slots) with states (e.g., FAIL/PFAIL) in the same file. I wonder if separating these concerns might reduce the likelihood of server crashes caused by disk failures.

yes, i think it can reduce the likelihood, but, we still are facing the same problem.

PingXie · 2025-01-06T03:01:24Z

so what happen if a async write fail, should we still do the exit when the async write stuck and eventaully fail?

The async write is just a means for the server to be able to time out stuck nodes.conf file updates. The update semantics should still be sync on a high level.

yes, i think it can reduce the likelihood, but, we still are facing the same problem.

Right it is a mitigation but might be good enough in practices.

enjoy-binbin · 2025-01-06T03:09:19Z

The async write is just a means for the server to be able to time out stuck nodes.conf file updates. The update semantics should still be sync on a high level.

ok, i see, we set a timeout and then exit before it gets stuck any longer, right?

Right it is a mitigation but might be good enough in practices.

The reason it that although we split it into two files, when something happen, conf or state changed, we will still need to update the file, which means we are still facing the exit problem, unless we treat some hot update as a non-exit updates, otherwise we will still hit the crash cause by disk failures since we anyway need to save the files (a one big file or two small files.)

PingXie · 2025-01-06T03:19:01Z

we set a timeout and then exit before it gets stuck any longer, right?

correct.

we will still need to update the file, which means we are still facing the exit problem, unless we treat some hot update as a non-exit updates, otherwise we will still hit the crash cause by disk failures since we anyway need to save the files

Yeah it can still fail but my point is that 100% uptime is neither a goal nor attainable. At the end of the day, anything we depend on could and will fail. With the reduced update frequency, I wonder if we could reason that the probability of server crashing due to disk IO failures could be on par with the probability of the (bare-metal or virtual) machine failure?

enjoy-binbin · 2025-01-06T03:26:01Z

another thought i want to mention is that, a lot of normal users (not the self-contained one) think Redis/Valkey is a memory databases, when disabling the persistence (RDB and AOF), they find it is difficult to understand and accept that a cluster fail casued by a conf file failing to update and mostly the conf file is maintained by us (or their admins)

and for a "self-contained" deployment, they will eventually find out that if there is a disk failure, the server will still crash after the reboot, and they will find out the solution is migrate the node (with its nodes.conf and data perhaps, or just add a new replica node in a new machine after the failover)

the disk we use are not reliable over time, which is a pain for me, so i want a way to slove this problem or migrate this problem, to avoid the stuck (latency) or exit.

i would also like to point out that mostly? i guess, or at least for us, people may deploy multiple servers on a single machine, so when a disk failure occurs, multiple nodes (or multiple instances) may be affected.

PingXie · 2025-01-06T05:35:33Z

and for a "self-contained" deployment, they will eventually find out that if there is a disk failure, the server will still crash after the reboot, and they will find out the solution is migrate the node (with its nodes.conf and data perhaps, or just add a new replica node in a new machine after the failover)

Yes, that is a disconnect.

As I mentioned earlier, I don't think the current cluster design requires node.conf updates to be synchronous and synchronous nodes.conf updates don't prevent staleness either, since a down node can re-join the cluster any time. The epoch mechanism is there to prevent nodes with stale cluster views from messing up the global cluster state. I’m open to the idea of introducing this configuration, provided it defaults to off.

enjoy-binbin · 2025-01-06T07:03:09Z

I’m open to the idea of introducing this configuration, provided it defaults to off.

i am glad that i got you, the default of this configuration is the same as before, that is, if the update fails, the node will do a exit.

enjoy-binbin · 2025-01-06T14:12:59Z

we can probably rename it to something like cluster-ignore-disk-write-error, just like replica-ignore-disk-write-errors and alian with it

# Replica ignore disk write errors controls the behavior of a replica when it is
# unable to persist a write command received from its primary to disk. By default,
# this configuration is set to 'no' and will crash the replica in this condition.
# It is not recommended to change this default.
#
# replica-ignore-disk-write-errors no

zuiderkwast · 2025-01-06T19:45:48Z

If we crash the server when writes to nodes.conf fail, we can reasonably expect nodes.conf to remain relatively fresh. However, if we allow the server to continue operating, the eventual crash could occur much later, creating "unbounded staleness."

Interesting! New words like this are always fun. Now, you gave me another idea: Bounded staleness.

Let's say we can tolerate write failures for say 5 minutes (or configurable). If the last successful write was older than that, we can crash. Then we can guarantee the file at least relatively new. How about that?

hpatro · 2025-01-06T20:23:07Z

Isn't it better to have a mode that doesn't even try to read or write nodes.conf at all, diskless cluster mode?

I was also having similar thoughts after dealing with issues regarding incorrect shard-id persisted to the disk temporarily and crash happens on the server, following that the server on restart reads the nodes.conf file and fails to startup due to failing validation check. Even though it's a bug in the code but the dependency on the nodes.conf leads to this issue. I would also prefer to have the config to disable both read/write from nodes.conf. However, the risk is, as all the information is lost and the node after restart would start having new node id, maybe it needs to undergo meet process with the rest of the cluster.

@enjoy-binbin / @zuiderkwast

enjoy-binbin · 2025-01-07T04:03:11Z

Let's say we can tolerate write failures for say 5 minutes (or configurable). If the last successful write was older than that, we can crash. Then we can guarantee the file at least relatively new. How about that?

i dont think it will help much. When a disk error happen, it is likely unable to recover, except the exhausting the disk, in this case, we can delete some files. But i guess in the most cases, it is unlikely recover, unless we migrate the disk or the machine. So the n minutes window doesn't look that useful.

I would also prefer to have the config to disable both read/write from nodes.conf. However, the risk is, as all the information is lost and the node after restart would start having new node id, maybe it needs to undergo meet process with the rest of the cluster

yes, i also like that idea, we should try to find a way to drop the nodes.conf in this mode, some people who have their own control panel might like this.

zuiderkwast · 2025-01-07T11:56:52Z

I'm OK with cluster-ignore-disk-write-error (aligned with replica-ignore-disk-write-errors).

For completely disabling nodes.conf, yes the node needs CLUSTER MEET again to join the cluster and it gets a new node-id. We use this with Kubernetes already. Our (internal) Kubernetes operator creates a new node to replace any node that disappears.

Persistent storage in Kubernetes can be problematic, especially in the case of netsplit when the operator can't reach the failing node. In this case, the operator can't be sure if the storage used by the node can safely be reused by another new node or if the node is still alive and is using the storage. Avoiding persistent storage in Kubernetes avoids this problem.

(FWIW, I think we should aim for an official Kubernetes operator as an open source control plane in the long term. It's a very non-trivial job though, because some users want persistant storage and others don't, cluster vs standalone, etc. There's a discussion here: https://github.com/orgs/valkey-io/discussions/746.)

enjoy-binbin · 2025-01-09T04:10:52Z

i would like to start adding some latency around it, see #1534

…exit Signed-off-by: Binbin <binloveplay1314@qq.com>

Signed-off-by: Binbin <binloveplay1314@qq.com>

src/cluster_legacy.c

tests/unit/latency-monitor.tcl

src/cluster_legacy.c

zuiderkwast

Let's get this merged. It is sad that a fix like this can be stale for a very long time.

We should update the cluster-specification to say that correctness does not depend on nodes.conf being synchronously stored. If it is old, the cluster can still recover. We should explain the worst scenarios like #1032 (comment).

src/cluster_legacy.c

zuiderkwast · 2025-12-08T10:37:31Z

@valkey-io/core-team This PR no longer adds a config, so I removed the major-decision-pending.

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Binbin <binloveplay1314@qq.com>

…exit Signed-off-by: Binbin <binloveplay1314@qq.com>

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-12-17T03:48:22Z

@valkey-io/core-team Do you guys want to take one last look before i merge it?

hpatro · 2025-12-17T18:35:42Z

I will take a look at this today!

hpatro

Overall it looks good to me.

src/cluster_legacy.c

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-12-22T02:35:33Z

Thank you all, i am merging it!

…y-io#1032) In clusterSaveConfigOrDie, we will exit directly when saving fails. In the case of disk failure, the cluster node will exit immediately if there are some changes around the cluster. We call it in many places, mainly in clusterBeforeSleep, that is, when the cluster configuration changes and we need to save. Passive exit may bring unexpected effects, such as cluster down. We think the risk of metadata becoming persistently out of date is minimal. On the one hand, we have the CLUSTER_WRITABLE_DELAY logic, which prevents a master node from being rejoined to the cluster in an unsafe case within 2 seconds. ``` void clusterUpdateState(void) { /* If this is a primary node, wait some time before turning the state * into OK, since it is not a good idea to rejoin the cluster as a writable * primary, after a reboot, without giving the cluster a chance to * reconfigure this node. Note that the delay is calculated starting from * the first call to this function and not since the server start, in order * to not count the DB loading time. */ if (first_call_time == 0) first_call_time = mstime(); if (clusterNodeIsPrimary(myself) && server.cluster->state == CLUSTER_FAIL && mstime() - first_call_time < CLUSTER_WRITABLE_DELAY) return; ``` The remaining potentially worse case is that the node votes twice in the same epoch. Like we didn't save nodes.conf and the we have voted for replica X. We reboot and during this time X wins the failover. After reboot, node Y requests vote for the same epoch and we vote for Y. Y wins the failover with the same epoch. We have two primaries with the same epoch. And we get an epoch collissions. It is resolved and some writes are lost. It's just like a failover, some writes can be lost. It may be very rare and it is not very bad. We will use the same CLUSTER_WRITABLE_DELAY logic to make an optimistic judgment and prevent the node from voting after restarting. Added a new clusterSaveConfigOrLog, if the save fails, instead of exiting, we will now just print a warning log. We will replace the clusterSaveConfigOrDie in clusterBeforeSleep with this. That is, the config save triggered by beforeSleep now will not exit the process even if the save fails. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

src/cluster_legacy.c

…y-io#1032) In clusterSaveConfigOrDie, we will exit directly when saving fails. In the case of disk failure, the cluster node will exit immediately if there are some changes around the cluster. We call it in many places, mainly in clusterBeforeSleep, that is, when the cluster configuration changes and we need to save. Passive exit may bring unexpected effects, such as cluster down. We think the risk of metadata becoming persistently out of date is minimal. On the one hand, we have the CLUSTER_WRITABLE_DELAY logic, which prevents a master node from being rejoined to the cluster in an unsafe case within 2 seconds. ``` void clusterUpdateState(void) { /* If this is a primary node, wait some time before turning the state * into OK, since it is not a good idea to rejoin the cluster as a writable * primary, after a reboot, without giving the cluster a chance to * reconfigure this node. Note that the delay is calculated starting from * the first call to this function and not since the server start, in order * to not count the DB loading time. */ if (first_call_time == 0) first_call_time = mstime(); if (clusterNodeIsPrimary(myself) && server.cluster->state == CLUSTER_FAIL && mstime() - first_call_time < CLUSTER_WRITABLE_DELAY) return; ``` The remaining potentially worse case is that the node votes twice in the same epoch. Like we didn't save nodes.conf and the we have voted for replica X. We reboot and during this time X wins the failover. After reboot, node Y requests vote for the same epoch and we vote for Y. Y wins the failover with the same epoch. We have two primaries with the same epoch. And we get an epoch collissions. It is resolved and some writes are lost. It's just like a failover, some writes can be lost. It may be very rare and it is not very bad. We will use the same CLUSTER_WRITABLE_DELAY logic to make an optimistic judgment and prevent the node from voting after restarting. Added a new clusterSaveConfigOrLog, if the save fails, instead of exiting, we will now just print a warning log. We will replace the clusterSaveConfigOrDie in clusterBeforeSleep with this. That is, the config save triggered by beforeSleep now will not exit the process even if the save fails. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…y-io#1032) In clusterSaveConfigOrDie, we will exit directly when saving fails. In the case of disk failure, the cluster node will exit immediately if there are some changes around the cluster. We call it in many places, mainly in clusterBeforeSleep, that is, when the cluster configuration changes and we need to save. Passive exit may bring unexpected effects, such as cluster down. We think the risk of metadata becoming persistently out of date is minimal. On the one hand, we have the CLUSTER_WRITABLE_DELAY logic, which prevents a master node from being rejoined to the cluster in an unsafe case within 2 seconds. ``` void clusterUpdateState(void) { /* If this is a primary node, wait some time before turning the state * into OK, since it is not a good idea to rejoin the cluster as a writable * primary, after a reboot, without giving the cluster a chance to * reconfigure this node. Note that the delay is calculated starting from * the first call to this function and not since the server start, in order * to not count the DB loading time. */ if (first_call_time == 0) first_call_time = mstime(); if (clusterNodeIsPrimary(myself) && server.cluster->state == CLUSTER_FAIL && mstime() - first_call_time < CLUSTER_WRITABLE_DELAY) return; ``` The remaining potentially worse case is that the node votes twice in the same epoch. Like we didn't save nodes.conf and the we have voted for replica X. We reboot and during this time X wins the failover. After reboot, node Y requests vote for the same epoch and we vote for Y. Y wins the failover with the same epoch. We have two primaries with the same epoch. And we get an epoch collissions. It is resolved and some writes are lost. It's just like a failover, some writes can be lost. It may be very rare and it is not very bad. We will use the same CLUSTER_WRITABLE_DELAY logic to make an optimistic judgment and prevent the node from voting after restarting. Added a new clusterSaveConfigOrLog, if the save fails, instead of exiting, we will now just print a warning log. We will replace the clusterSaveConfigOrDie in clusterBeforeSleep with this. That is, the config save triggered by beforeSleep now will not exit the process even if the save fails. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>

This commit introduces a new configuration option `cluster-persist-config` that controls how the cluster handles nodes.conf file save failures. The option supports two modes: - `sync` (default): Synchronously save the config file. If the save fails, the process exits. This maintains backward compatibility with the old behavior (before 9.1). - `best-effort`: Synchronously save the config file. If the save fails, only log a warning and continue running. This allows the node to survive disk failures (e.g., disk full, read-only filesystem) without exitting, giving administrators time to address the issue. Note that this modifies the behavior of valkey-io#1032, whereas valkey-io#1032 was "best-effort", we have now introduced a configuration option that defaults to "sync." See valkey-io#1032 discussion for more details. Background: When a disk becomes read-only or full, any cluster metadata change would trigger a nodes.conf save attempt. With the old behavior, the node would immediately exit via clusterSaveConfigOrDie(), potentially causing multiple nodes on the same machine to crash simultaneously, leading to cluster unavailability. The new `best-effort` mode addresses this by allowing nodes to continue operating even when disk writes fail. This is particularly useful in cloud environments where disk failures are more common due to scale. Note: Startup-time config saves (in clusterInit and verifyClusterConfigWithData) still use clusterSaveConfigOrDie() since disk issues at startup should cause immediate failure. Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added 2 commits January 11, 2025 11:07

Merge remote-tracking branch 'upstream/unstable' into cluster_config_…

321417f

…exit Signed-off-by: Binbin <binloveplay1314@qq.com>

Change to new name: cluster-ignore-disk-write-errors and add test

36fda32

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin changed the title ~~Introduce exit-on-cluster-config-file-save-error to prevent the process from existing when saving fails~~ Introduce cluster-ignore-disk-write-errors to prevent the process from existing when saving fails Jan 11, 2025

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Jan 11, 2025

enjoy-binbin commented Jan 11, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin commented Jan 11, 2025

View reviewed changes

tests/unit/latency-monitor.tcl Show resolved Hide resolved

enjoy-binbin added the major-decision-pending Major decision pending by TSC team label Jan 17, 2025

enjoy-binbin changed the title ~~Introduce cluster-ignore-disk-write-errors to prevent the process from exiting when saving fails~~ Fail to save the cluster config file will not exit the process Sep 26, 2025

cherukum-Amazon reviewed Sep 26, 2025

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

enjoy-binbin removed this from Valkey 9.0 Dec 8, 2025

enjoy-binbin added this to Valkey 9.1 Dec 8, 2025

zuiderkwast approved these changes Dec 8, 2025

View reviewed changes

zuiderkwast added the needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. label Dec 8, 2025

zuiderkwast reviewed Dec 8, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

zuiderkwast removed the major-decision-pending Major decision pending by TSC team label Dec 8, 2025

zuiderkwast moved this to In Progress in Valkey 9.1 Dec 8, 2025

enjoy-binbin and others added 3 commits December 9, 2025 11:32

Update src/cluster_legacy.c

38a6f0b

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Binbin <binloveplay1314@qq.com>

Merge remote-tracking branch 'upstream/unstable' into cluster_config_…

3a8f2e7

…exit Signed-off-by: Binbin <binloveplay1314@qq.com>

fix test

5bce57c

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Dec 9, 2025

hpatro self-requested a review December 17, 2025 18:35

hpatro reviewed Dec 17, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

code review

d862233

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin force-pushed the cluster_config_exit branch from 3fb4178 to d862233 Compare December 18, 2025 02:47

enjoy-binbin merged commit 82d7b7e into valkey-io:unstable Dec 22, 2025
62 of 64 checks passed

github-project-automation bot moved this from In Progress to Done in Valkey 9.1 Dec 22, 2025

enjoy-binbin deleted the cluster_config_exit branch December 22, 2025 02:36

PingXie reviewed Jan 5, 2026

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

enjoy-binbin mentioned this pull request Mar 17, 2026

Add cluster-persist-config option to control nodes.conf save behavior #3372

Open

Conversation

enjoy-binbin commented Sep 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Sep 14, 2024

Uh oh!

codecov bot commented Sep 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

madolson commented Oct 1, 2024

Uh oh!

enjoy-binbin commented Oct 6, 2024

Uh oh!

enjoy-binbin commented Dec 27, 2024

Uh oh!

enjoy-binbin commented Dec 27, 2024

Uh oh!

zuiderkwast commented Jan 2, 2025

Uh oh!

enjoy-binbin commented Jan 3, 2025

Uh oh!

PingXie commented Jan 5, 2025

Uh oh!

enjoy-binbin commented Jan 6, 2025

Uh oh!

PingXie commented Jan 6, 2025

Uh oh!

enjoy-binbin commented Jan 6, 2025

Uh oh!

PingXie commented Jan 6, 2025

Uh oh!

enjoy-binbin commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingXie commented Jan 6, 2025

Uh oh!

enjoy-binbin commented Jan 6, 2025

Uh oh!

enjoy-binbin commented Jan 6, 2025

Uh oh!

zuiderkwast commented Jan 6, 2025

Uh oh!

hpatro commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Jan 7, 2025

Uh oh!

zuiderkwast commented Jan 7, 2025

Uh oh!

enjoy-binbin commented Jan 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zuiderkwast commented Dec 8, 2025

Uh oh!

enjoy-binbin commented Dec 17, 2025

Uh oh!

hpatro commented Dec 17, 2025

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

enjoy-binbin commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

enjoy-binbin commented Sep 14, 2024 •

edited

Loading

codecov bot commented Sep 14, 2024 •

edited

Loading

enjoy-binbin commented Jan 6, 2025 •

edited

Loading

hpatro commented Jan 6, 2025 •

edited

Loading