Node becomes stale, Raft leader election fails, cluster stops (2-node CMG + MetaStorage)

I’m using Apache Ignite 3.1 in a production environment with a 2 server-nodes cluster.
Both nodes are configured as CMG and MetaStorage nodes.
Recently, the cluster stopped working due to Raft leader election timeouts.
While investigating the logs, I found the following messages indicating that nodes became stale:

2025-12-19 23:39:37:970 +0530 [WARNING][Node1-network-worker-10][HandshakeManagerUtils]
Rejecting handshake: Node2:b1bbd239-630b-4da1-92f8-0fe86f6aa435 is stale, node should be restarted so that other nodes can connect

2025-12-19 23:39:47:777 +0530 [WARNING][Node1-network-worker-11][RecoveryAcceptorHandshakeManager]
Handshake rejected by initiator: Node1:96c65e18-3a2c-47bb-b21a-3d3d2119e3eb is stale, node should be restarted so that other nodes can connect

After this, the cluster started failing with Raft-related timeouts, MetaStorage leader could not be elected, and the cluster became unavailable.
**Questions**

- What can cause a node to become “stale” in Ignite 3, even when:
No scale up/down was performed, No rebalancing was triggered, The cluster topology was unchanged?

- How can this situation be avoided? Are there specific timeouts that should be tuned? Are there best practices for preventing stale nodes in production?

- FailureHandler behavior
Currently, my node configuration uses:

failureHandler {
    dumpThreadsOnFailure=true
    dumpThreadsThrottlingTimeoutMillis=10000
    handler {
        ignoredFailureTypes=[
            systemWorkerBlocked,
            systemCriticalOperationTimeout
        ]
        type=noop
    }
    oomBufferSizeBytes=16384
}


I am considering changing it to:

failureHandler {
    dumpThreadsOnFailure=true
    dumpThreadsThrottlingTimeoutMillis=10000
    handler {
        ignoredFailureTypes=[
            systemWorkerBlocked,
            systemCriticalOperationTimeout
        ]
        type=stop
    }
    oomBufferSizeBytes=16384
}

Additionally, I configured the Ignite service to automatically restart the node after 5 seconds.
Will this approach help in automatically restarting a stale node and allowing it to rejoin the cluster cleanly?
Is this the recommended approach for production environments?

- Cluster stability
Finally, I would appreciate guidance on:
* Recommended production configuration
* Cluster sizing considerations
* Any known limitations or best practices to ensure cluster stability and avoid full outages
My main concern is to keep the cluster stable in production and avoid complete unavailability.

Thank you for your guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Node becomes stale, Raft leader election fails, cluster stops (2-node CMG + MetaStorage) #12597

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node becomes stale, Raft leader election fails, cluster stops (2-node CMG + MetaStorage) #12597

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions