Skip to content

Node becomes stale, Raft leader election fails, cluster stops (2-node CMG + MetaStorage) #12597

@GhoufranGhazaly

Description

@GhoufranGhazaly

I’m using Apache Ignite 3.1 in a production environment with a 2 server-nodes cluster.
Both nodes are configured as CMG and MetaStorage nodes.
Recently, the cluster stopped working due to Raft leader election timeouts.
While investigating the logs, I found the following messages indicating that nodes became stale:

2025-12-19 23:39:37:970 +0530 [WARNING][Node1-network-worker-10][HandshakeManagerUtils]
Rejecting handshake: Node2:b1bbd239-630b-4da1-92f8-0fe86f6aa435 is stale, node should be restarted so that other nodes can connect

2025-12-19 23:39:47:777 +0530 [WARNING][Node1-network-worker-11][RecoveryAcceptorHandshakeManager]
Handshake rejected by initiator: Node1:96c65e18-3a2c-47bb-b21a-3d3d2119e3eb is stale, node should be restarted so that other nodes can connect

After this, the cluster started failing with Raft-related timeouts, MetaStorage leader could not be elected, and the cluster became unavailable.
Questions

  • What can cause a node to become “stale” in Ignite 3, even when:
    No scale up/down was performed, No rebalancing was triggered, The cluster topology was unchanged?

  • How can this situation be avoided? Are there specific timeouts that should be tuned? Are there best practices for preventing stale nodes in production?

  • FailureHandler behavior
    Currently, my node configuration uses:

failureHandler {
dumpThreadsOnFailure=true
dumpThreadsThrottlingTimeoutMillis=10000
handler {
ignoredFailureTypes=[
systemWorkerBlocked,
systemCriticalOperationTimeout
]
type=noop
}
oomBufferSizeBytes=16384
}

I am considering changing it to:

failureHandler {
dumpThreadsOnFailure=true
dumpThreadsThrottlingTimeoutMillis=10000
handler {
ignoredFailureTypes=[
systemWorkerBlocked,
systemCriticalOperationTimeout
]
type=stop
}
oomBufferSizeBytes=16384
}

Additionally, I configured the Ignite service to automatically restart the node after 5 seconds.
Will this approach help in automatically restarting a stale node and allowing it to rejoin the cluster cleanly?
Is this the recommended approach for production environments?

  • Cluster stability
    Finally, I would appreciate guidance on:
  • Recommended production configuration
  • Cluster sizing considerations
  • Any known limitations or best practices to ensure cluster stability and avoid full outages
    My main concern is to keep the cluster stable in production and avoid complete unavailability.

Thank you for your guidance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions