MXS-6032 Add semi-synchronous failover tutorial

ekorh475 · ekorh475 · commit 8cc8c58f96db · 2025-11-26T17:16:50.000+02:00
Describes how to configure servers for maximum resilience for use
with MaxScale auto_failover.
diff --git a/maxscale/SUMMARY.md b/maxscale/SUMMARY.md
@@ -61,6 +61,7 @@
   * [Configuring the MariaDB Monitor](mariadb-maxscale-tutorials/configuring-the-mariadb-monitor.md)
   * [Connection Routing with MariaDB MaxScale](mariadb-maxscale-tutorials/connection-routing-with-mariadb-maxscale.md)
   * [Encrypting Passwords](mariadb-maxscale-tutorials/encrypting-passwords.md)
+  * [Failure-tolerant replication and failover](mariadb-maxscale-tutorials/failure-tolerant-replication-and-failover.md)
   * [Filters](mariadb-maxscale-tutorials/filters.md)
   * [MaxScale Administration Tutorial](mariadb-maxscale-tutorials/maxscale-administration-tutorial.md)
   * [Read-Write Splitting](mariadb-maxscale-tutorials/read-write-splitting.md)
diff --git a/maxscale/mariadb-maxscale-tutorials/failure-tolerant-replication-and-failover.md b/maxscale/mariadb-maxscale-tutorials/failure-tolerant-replication-and-failover.md
@@ -0,0 +1,267 @@
+# Failure-tolerant replication and failover
+
+## Introduction
+
+The goal of this guide is to set up a replication cluster managed by MaxScale that is reasonably
+tolerant of failures, i.e. even if one part fails, the cluster continues to work. Additionally,
+transaction data should be preserved whenever possible. All of this should work automatically.
+
+This guide assumes that the reader is familiar with MariaDB replication and GTIDs,
+[MariaDB Monitor](../reference/maxscale-monitors/mariadb-monitor.md), and
+[failover](automatic-failover-with-mariadb-monitor.md).
+
+## The problem with normal replication
+
+The basic problem of replication is that the primary and replica are not always in the same
+state. When a commit is performed on the primary, the primary updates both the actual database file
+and the binary log. These items are updated in a transactional manner, either both succeed or
+fail. Then, the primary sends the binary log event to the replicas and they update their own
+databases and logs.
+
+A replica may crash or lose connection to the primary. Fortunately, this is not a big issue as once
+the replica returns, it can simply resume replication from where it left off. The replica cannot
+diverge as it is always either in the same state as the primary, or behind.  Only if the primary
+lacks the binary logs from the moment the replica went down, is the replica lost.
+
+If the primary crashes or loses network connection, failover may lose data. This depends on at which
+point the crash happens:
+
+<ol type="A">
+  <li>If the primary managed to send all committed transactions to a replica, then all is still
+well. The replica has all the data, and can be promoted to primary e.g. by a MaxScale (MaxScale will
+promote the most up-to-date replica). Once the old primary returns, it can rejoin the cluster.</li>
+  <li>If the primary crashes just after it committed a transaction and updated its binary log, but
+before it sent the binary log event to a replica, then failover loses data and the old primary can
+no longer rejoin the cluster.</li>
+</ol>
+
+Let’s look at situation B in more detail. *server1* is the original primary as and its replicas are
+*server2* and *server3*, with server ids 1,2 and 3, respectively. *server1* is at gtid 1-1-101 when
+it crashes while the others have replicated to the previous event 1-1-100. The example server status
+output below is for demonstration only, since in reality it would be unlikely that the monitor would
+manage to update the gtid-position of *server1* right at the moment of crash.
+
+```bash
+$ maxctrl list servers
+┌──────────┬─────────────────┬──────┬─────────────┬────────┬────────────────────┬─────────┬──────────┐
+│ Server   │ Address         │ Port │ Connections │ Status │ Status Info        │ GTID    │ Monitor  │
+├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
+│ server1  │ 192.168.121.51  │ 3306 │ 0           │ Write  │ Down               │ 1-1-101 │ Monitor1 │
+├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
+│ server2  │ 192.168.121.190 │ 3306 │ 0           │ Read   │ Replica, read_only │ 1-1-100 │ Monitor1 │
+├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
+│ server3  │ 192.168.121.112 │ 3306 │ 0           │ Read   │ Replica, read_only │ 1-1-100 │ Monitor1 │
+└──────────┴─────────────────┴──────┴─────────────┴────────┴────────────────────┴─────────┴──────────┘
+```
+
+*server1* stays down long enough for failover to activate (in MaxScale, the time is roughly
+*monitor_interval* * *failcount*).  *server2* gets promoted, and MaxScale routes any new writes to
+it. *server2* starts generating binary log events with gtids 1-2-101, 1-2-102 and so on. If
+*server1* now comes back online, it can no longer rejoin as it is at gtid 1-1-101, which conflicts
+with 1-2-101.
+
+```bash
+$ maxctrl list servers
+┌──────────┬─────────────────┬──────┬─────────────┬────────┬────────────────────┬─────────┬──────────┐
+│ Server   │ Address         │ Port │ Connections │ Status │ Status Info        │ GTID    │ Monitor  │
+├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
+│ server1  │ 192.168.121.51  │ 3306 │ 0           │ Up     │                    │ 1-1-101 │ Monitor1 │
+├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
+│ server2  │ 192.168.121.190 │ 3306 │ 0           │ Write  │ Primary            │ 1-2-102 │ Monitor1 │
+├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
+│ server3  │ 192.168.121.112 │ 3306 │ 0           │ Read   │ Replica, read_only │ 1-2-102 │ Monitor1 │
+└──────────┴─────────────────┴──────┴─────────────┴────────┴────────────────────┴─────────┴──────────┘
+```
+
+At this point, the DBA could forcefully alter the gtid of *server1*, setting it to 1-1-100, which is
+in *server2*’s binary log, enabling rejoin. This is usually ill-advised, as changing the gtid does
+not rollback the actual data in *server2*’s database, meaning that data conflicts can still happen,
+perhaps days later. A more reliable way to handle this case is to rebuild *server1* from one of the
+other servers (MaxScale can help with this process, but it requires configuration and [manual
+launching](../reference/maxscale-monitors/mariadb-monitor.md#backup-operations)).
+
+If the old primary returns before failover activates, then replication can continue regardless of
+what exact moment the crash happened. **This means that the DBA should configure automatic failover
+to happen only after the primary has been down so long that the downsides of service outage outweigh
+the threat of losing data and having to rebuild the old primary.** *monitor_interval* * *failcount*
+should at minimum be large enough so that failover does not trigger due to a momentary network
+failure.
+
+## Semi-synchronous replication
+
+[Semi-synchronous replication](https://mariadb.com/docs/server/ha-and-performance/standard-replication/semisynchronous-replication) 
+offers a more reliable, but also a more complicated way to keep the cluster in sync.  Semi-sync
+replication means that after the primary commits a transaction, it does not immediately return an OK
+to the client. The primary instead sends the binary log update to the replicas and waits for an
+acknowledgement from at least one replica before sending the OK-message back to the client. This
+means that once the client gets the OK, the transaction data is typically on at least two
+servers. This is not absolutely certain, as the primary does not wait forever for the replica
+acknowledgement. If no replica responds in time, the primary switches to normal replication and
+returns OK to the client. This timeout is controlled by MariaDB Server setting
+**rpl_semi_sync_master_timeout**. If this limit is being hit, the client should notice it by the
+transaction visibly stalling. Even if this limit is not hit, throughput suffers compared to normal
+replication due to the additional waiting.
+
+Semi-synchronous replication by itself does not remove the possibility of the primary diverging
+after a crash.  The scenario B presented in the previous section can still happen if the primary
+crashes after applying the transaction but before sending it to replicas. To increase crash safety,
+a few settings need to be tuned to change the behavior of the primary server both during replication
+and during startup after a crash (crash recovery).
+
+### Enable semi-synchronous replication
+
+To enable semi semi-synchronous replication, add the following to the configuration files of all
+servers:
+
+```ini
+rpl_semi_sync_master_enabled=ON
+rpl_semi_sync_slave_enabled=ON
+```
+
+These settings allow the servers to act as both semi-sync primary and replica, which is useful when
+combined with automatic failover. Restart the servers and run `show status like 'rpl%';`, it shows
+semi-sync related status variables. Check the values of Rpl_semi_sync_master_clients,
+Rpl_semi_sync_master_status and Rpl_semi_sync_slave_status. On the primary, their values should be:
+
+1. Rpl_semi_sync_master_clients \<number of replicas\>
+2. Rpl_semi_sync_master_status ON
+3. Rpl_semi_sync_slave_status OFF
+
+On the replicas, the values should be:
+
+1. Rpl_semi_sync_master_clients 0
+2. Rpl_semi_sync_master_status ON
+3. Rpl_semi_sync_slave_status ON
+
+### Configure wait point and startup role
+
+Next, change the point at which the primary server waits for the replica acknowledgement. By
+default, this is after the transaction is written to storage, which is not much different compared
+to normal replication. Set the following in the server configuration files:
+
+```ini
+rpl_semi_sync_master_wait_point=AFTER_SYNC
+```
+
+AFTER_SYNC means that the primary sends the binary log event to replicas after writing it to the
+binary log but before committing the actual transaction to its own storage. The primary only updates
+its internal storage once at least one replica gives the reply or the rpl_semi_sync_master_timeout
+is reached. More importantly, this means that if the primary crashes while waiting for the reply,
+its binary log and storage engine will be out of sync. On startup, the server must thus decide what
+to do: either consider the binary log correct and apply the missing transactions to storage or
+discard the unverified binary log events.
+
+As of MariaDB Server 10.6.19, 10.11.9, 11.1.6, 11.2.5, 11.4.3 and 11.5.2, this decision is
+controlled by the startup option
+[init-rpl-role](https://mariadb.com/docs/server/ha-and-performance/standard-replication/semisynchronous-replication#init-rpl-role).
+If set to MASTER, the server applies the transactions during startup, as it assumes to still be
+the primary. If init-rpl-role is set to SLAVE, the server discards the transactions. The former
+option does not improve the situation after a failover, as the primary could apply transactions that
+never made it to a replica. The latter option, on the other hand, ensures that when the old primary
+comes back online, it does not have conflicting transactions and can rejoin the cluster as a
+replica. So, add the following to all server configurations:
+
+```ini
+init-rpl-role=SLAVE
+```
+
+### Configure service restart delay
+
+This scheme is not entirely without issues. `init-rpl-role=SLAVE` means that the old primary
+(*server1*) will always discard the unverified transactions during startup, even if the data did
+successfully replicate to a replica before the crash (*server1* crashed after sending the data but
+before receiving the reply). This is not an issue if failover has already occurred by this point, as
+*server1* can just fetch the same writes from the new primary (*server2*). However, if *server1*
+comes back online quickly, before failover, it could be behind *server2*: *server1* at gtid 1-1-100
+and *server2* at 1-1-101.
+
+```bash
+$ maxctrl list servers
+┌──────────┬─────────────────┬──────┬─────────────┬────────┬────────────────────┬─────────┬──────────┐
+│ Server   │ Address         │ Port │ Connections │ Status │ Status Info        │ GTID    │ Monitor  │
+├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
+│ server1  │ 192.168.121.51  │ 3306 │ 0           │ Write  │ Primary            │ 1-1-100 │ Monitor1 │
+├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
+│ server2  │ 192.168.121.190 │ 3306 │ 0           │ Read   │ Replica, read_only │ 1-1-101 │ Monitor1 │
+├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
+│ server3  │ 192.168.121.112 │ 3306 │ 0           │ Read   │ Replica, read_only │ 1-1-101 │ Monitor1 │
+└──────────┴─────────────────┴──────┴─────────────┴────────┴────────────────────┴─────────┴──────────┘
+```
+
+As *server1* is still the primary, MaxScale sends any new writes to it. The next write will get gtid
+1-1-101, meaning that both servers are on the same gtid but their actual contents likely
+differ. This will cause a replication error at some point.
+
+This means that if a primary server crashes, it must stay down long enough for failover to occur and
+to ensure MaxScale does not treat it as the primary once it returns. This can be enforced by
+changing the service settings. Run `sudo systemctl edit mariadb.service` on all server machines and
+add the following:
+
+```ini
+[Service]
+RestartSec=1min
+```
+
+The configured time needs to be comfortably longer than *monitor_interval* * *failcount* configured
+to the MariaDB Monitor in MaxScale.
+
+### Failure scenarios
+
+The setup described above protects against a primary server crashing and diverging from the rest of
+the cluster.  It does not protect from data loss due to network outages. If the connection to the
+primary server is lost during a transaction commit (before the data was replicated), the primary
+will eventually apply the transaction to its storage. If a failover occurred during the network
+outage, the rest of the cluster has already continued under a new primary, leaving the old one
+diverged. This is similar to normal replication.
+
+Some queries are not transactional and may still risk diverging replication. These are typically
+DDL-queries such as CREATE TABLE, ALTER TABLE and so on. DDL-queries can be transactional if the
+changes they do are “small”, such as renaming a column. Large-scale DDL-queries (e.g. ADD COLUMN to
+a table with a billion rows) are more vulnerable. The settings presented in this document were only
+tested against simple DML-queries that updated a single row.
+
+### Client perspective
+
+If the client gets an OK to its commit command, then it knows that (assuming no semi-sync timeout
+happened) the transaction is in at least two servers. Only the primary has certainly processed the
+update at this point, the replica may just have the event in its relay log. This means that
+SELECT-queries routed to a replica (e.g. by the ReadWriteSplit-router) may see old data.
+
+If the client does not get an OK due to primary crash, then the situation is more ambiguous:
+
+<ol type="A">
+  <li>Primary crashed before starting the commit</li>
+  <li>Primary crashed just before receiving the replica acknowledgement</li>
+  <li>the primary crashed just as it was about to send the OK</li>
+</ol>
+
+In case A, the transaction is lost. In case B, the transaction is present on the replica and will be
+visible at some point. In case C, the transaction is present on both servers. Since MaxScale cannot
+know which case is in effect, it does not attempt transaction replay (even if configured) and
+disconnects the client. It’s up to the client to then reconnect and figure out the status of the
+transaction.
+
+### Test summary
+
+The server settings used during testing are below. rpl_semi_sync_master_timeout is given in
+milliseconds, rpl_semi_sync_slave_kill_conn_timeout in seconds.
+
+```ini
+rpl_semi_sync_master_enabled=ON
+rpl_semi_sync_slave_enabled=ON
+rpl_semi_sync_master_wait_point=AFTER_SYNC
+rpl_semi_sync_master_timeout=6000
+rpl_semi_sync_slave_kill_conn_timeout=5
+init-rpl-role=SLAVE
+gtid_strict_mode=1
+log_slave_updates=1
+```
+
+The MariaDB Monitor in MaxScale was configured with both *auto_failover* and *auto_rejoin* enabled.
+Hundreds of failovers with continuous write traffic succeeded without a diverging old primary. When
+*init-rpl-role* was changed to MASTER, replication eventually broke, although this could take some
+time.
+
+<sub>_This page is licensed: CC BY-SA / Gnu FDL_</sub>
+
+{% @marketo/form formId="4316" %}