Skip to content

Commit 8cc8c58

Browse files
committed
MXS-6032 Add semi-synchronous failover tutorial
Describes how to configure servers for maximum resilience for use with MaxScale auto_failover.
1 parent 40692f1 commit 8cc8c58

File tree

2 files changed

+268
-0
lines changed

2 files changed

+268
-0
lines changed

maxscale/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@
6161
* [Configuring the MariaDB Monitor](mariadb-maxscale-tutorials/configuring-the-mariadb-monitor.md)
6262
* [Connection Routing with MariaDB MaxScale](mariadb-maxscale-tutorials/connection-routing-with-mariadb-maxscale.md)
6363
* [Encrypting Passwords](mariadb-maxscale-tutorials/encrypting-passwords.md)
64+
* [Failure-tolerant replication and failover](mariadb-maxscale-tutorials/failure-tolerant-replication-and-failover.md)
6465
* [Filters](mariadb-maxscale-tutorials/filters.md)
6566
* [MaxScale Administration Tutorial](mariadb-maxscale-tutorials/maxscale-administration-tutorial.md)
6667
* [Read-Write Splitting](mariadb-maxscale-tutorials/read-write-splitting.md)
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
# Failure-tolerant replication and failover
2+
3+
## Introduction
4+
5+
The goal of this guide is to set up a replication cluster managed by MaxScale that is reasonably
6+
tolerant of failures, i.e. even if one part fails, the cluster continues to work. Additionally,
7+
transaction data should be preserved whenever possible. All of this should work automatically.
8+
9+
This guide assumes that the reader is familiar with MariaDB replication and GTIDs,
10+
[MariaDB Monitor](../reference/maxscale-monitors/mariadb-monitor.md), and
11+
[failover](automatic-failover-with-mariadb-monitor.md).
12+
13+
## The problem with normal replication
14+
15+
The basic problem of replication is that the primary and replica are not always in the same
16+
state. When a commit is performed on the primary, the primary updates both the actual database file
17+
and the binary log. These items are updated in a transactional manner, either both succeed or
18+
fail. Then, the primary sends the binary log event to the replicas and they update their own
19+
databases and logs.
20+
21+
A replica may crash or lose connection to the primary. Fortunately, this is not a big issue as once
22+
the replica returns, it can simply resume replication from where it left off. The replica cannot
23+
diverge as it is always either in the same state as the primary, or behind. Only if the primary
24+
lacks the binary logs from the moment the replica went down, is the replica lost.
25+
26+
If the primary crashes or loses network connection, failover may lose data. This depends on at which
27+
point the crash happens:
28+
29+
<ol type="A">
30+
<li>If the primary managed to send all committed transactions to a replica, then all is still
31+
well. The replica has all the data, and can be promoted to primary e.g. by a MaxScale (MaxScale will
32+
promote the most up-to-date replica). Once the old primary returns, it can rejoin the cluster.</li>
33+
<li>If the primary crashes just after it committed a transaction and updated its binary log, but
34+
before it sent the binary log event to a replica, then failover loses data and the old primary can
35+
no longer rejoin the cluster.</li>
36+
</ol>
37+
38+
Let’s look at situation B in more detail. *server1* is the original primary as and its replicas are
39+
*server2* and *server3*, with server ids 1,2 and 3, respectively. *server1* is at gtid 1-1-101 when
40+
it crashes while the others have replicated to the previous event 1-1-100. The example server status
41+
output below is for demonstration only, since in reality it would be unlikely that the monitor would
42+
manage to update the gtid-position of *server1* right at the moment of crash.
43+
44+
```bash
45+
$ maxctrl list servers
46+
┌──────────┬─────────────────┬──────┬─────────────┬────────┬────────────────────┬─────────┬──────────┐
47+
│ Server │ Address │ Port │ Connections │ Status │ Status Info │ GTID │ Monitor │
48+
├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
49+
│ server1 │ 192.168.121.51 │ 3306 │ 0 │ Write │ Down │ 1-1-101 │ Monitor1 │
50+
├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
51+
│ server2 │ 192.168.121.190 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-1-100 │ Monitor1 │
52+
├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
53+
│ server3 │ 192.168.121.112 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-1-100 │ Monitor1 │
54+
└──────────┴─────────────────┴──────┴─────────────┴────────┴────────────────────┴─────────┴──────────┘
55+
```
56+
57+
*server1* stays down long enough for failover to activate (in MaxScale, the time is roughly
58+
*monitor_interval* * *failcount*). *server2* gets promoted, and MaxScale routes any new writes to
59+
it. *server2* starts generating binary log events with gtids 1-2-101, 1-2-102 and so on. If
60+
*server1* now comes back online, it can no longer rejoin as it is at gtid 1-1-101, which conflicts
61+
with 1-2-101.
62+
63+
```bash
64+
$ maxctrl list servers
65+
┌──────────┬─────────────────┬──────┬─────────────┬────────┬────────────────────┬─────────┬──────────┐
66+
│ Server │ Address │ Port │ Connections │ Status │ Status Info │ GTID │ Monitor │
67+
├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
68+
│ server1 │ 192.168.121.51 │ 3306 │ 0 │ Up │ │ 1-1-101 │ Monitor1 │
69+
├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
70+
│ server2 │ 192.168.121.190 │ 3306 │ 0 │ Write │ Primary │ 1-2-102 │ Monitor1 │
71+
├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
72+
│ server3 │ 192.168.121.112 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-2-102 │ Monitor1 │
73+
└──────────┴─────────────────┴──────┴─────────────┴────────┴────────────────────┴─────────┴──────────┘
74+
```
75+
76+
At this point, the DBA could forcefully alter the gtid of *server1*, setting it to 1-1-100, which is
77+
in *server2*’s binary log, enabling rejoin. This is usually ill-advised, as changing the gtid does
78+
not rollback the actual data in *server2*’s database, meaning that data conflicts can still happen,
79+
perhaps days later. A more reliable way to handle this case is to rebuild *server1* from one of the
80+
other servers (MaxScale can help with this process, but it requires configuration and [manual
81+
launching](../reference/maxscale-monitors/mariadb-monitor.md#backup-operations)).
82+
83+
If the old primary returns before failover activates, then replication can continue regardless of
84+
what exact moment the crash happened. **This means that the DBA should configure automatic failover
85+
to happen only after the primary has been down so long that the downsides of service outage outweigh
86+
the threat of losing data and having to rebuild the old primary.** *monitor_interval* * *failcount*
87+
should at minimum be large enough so that failover does not trigger due to a momentary network
88+
failure.
89+
90+
## Semi-synchronous replication
91+
92+
[Semi-synchronous replication](https://mariadb.com/docs/server/ha-and-performance/standard-replication/semisynchronous-replication)
93+
offers a more reliable, but also a more complicated way to keep the cluster in sync. Semi-sync
94+
replication means that after the primary commits a transaction, it does not immediately return an OK
95+
to the client. The primary instead sends the binary log update to the replicas and waits for an
96+
acknowledgement from at least one replica before sending the OK-message back to the client. This
97+
means that once the client gets the OK, the transaction data is typically on at least two
98+
servers. This is not absolutely certain, as the primary does not wait forever for the replica
99+
acknowledgement. If no replica responds in time, the primary switches to normal replication and
100+
returns OK to the client. This timeout is controlled by MariaDB Server setting
101+
**rpl_semi_sync_master_timeout**. If this limit is being hit, the client should notice it by the
102+
transaction visibly stalling. Even if this limit is not hit, throughput suffers compared to normal
103+
replication due to the additional waiting.
104+
105+
Semi-synchronous replication by itself does not remove the possibility of the primary diverging
106+
after a crash. The scenario B presented in the previous section can still happen if the primary
107+
crashes after applying the transaction but before sending it to replicas. To increase crash safety,
108+
a few settings need to be tuned to change the behavior of the primary server both during replication
109+
and during startup after a crash (crash recovery).
110+
111+
### Enable semi-synchronous replication
112+
113+
To enable semi semi-synchronous replication, add the following to the configuration files of all
114+
servers:
115+
116+
```ini
117+
rpl_semi_sync_master_enabled=ON
118+
rpl_semi_sync_slave_enabled=ON
119+
```
120+
121+
These settings allow the servers to act as both semi-sync primary and replica, which is useful when
122+
combined with automatic failover. Restart the servers and run `show status like 'rpl%';`, it shows
123+
semi-sync related status variables. Check the values of Rpl_semi_sync_master_clients,
124+
Rpl_semi_sync_master_status and Rpl_semi_sync_slave_status. On the primary, their values should be:
125+
126+
1. Rpl_semi_sync_master_clients \<number of replicas\>
127+
2. Rpl_semi_sync_master_status ON
128+
3. Rpl_semi_sync_slave_status OFF
129+
130+
On the replicas, the values should be:
131+
132+
1. Rpl_semi_sync_master_clients 0
133+
2. Rpl_semi_sync_master_status ON
134+
3. Rpl_semi_sync_slave_status ON
135+
136+
### Configure wait point and startup role
137+
138+
Next, change the point at which the primary server waits for the replica acknowledgement. By
139+
default, this is after the transaction is written to storage, which is not much different compared
140+
to normal replication. Set the following in the server configuration files:
141+
142+
```ini
143+
rpl_semi_sync_master_wait_point=AFTER_SYNC
144+
```
145+
146+
AFTER_SYNC means that the primary sends the binary log event to replicas after writing it to the
147+
binary log but before committing the actual transaction to its own storage. The primary only updates
148+
its internal storage once at least one replica gives the reply or the rpl_semi_sync_master_timeout
149+
is reached. More importantly, this means that if the primary crashes while waiting for the reply,
150+
its binary log and storage engine will be out of sync. On startup, the server must thus decide what
151+
to do: either consider the binary log correct and apply the missing transactions to storage or
152+
discard the unverified binary log events.
153+
154+
As of MariaDB Server 10.6.19, 10.11.9, 11.1.6, 11.2.5, 11.4.3 and 11.5.2, this decision is
155+
controlled by the startup option
156+
[init-rpl-role](https://mariadb.com/docs/server/ha-and-performance/standard-replication/semisynchronous-replication#init-rpl-role).
157+
If set to MASTER, the server applies the transactions during startup, as it assumes to still be
158+
the primary. If init-rpl-role is set to SLAVE, the server discards the transactions. The former
159+
option does not improve the situation after a failover, as the primary could apply transactions that
160+
never made it to a replica. The latter option, on the other hand, ensures that when the old primary
161+
comes back online, it does not have conflicting transactions and can rejoin the cluster as a
162+
replica. So, add the following to all server configurations:
163+
164+
```ini
165+
init-rpl-role=SLAVE
166+
```
167+
168+
### Configure service restart delay
169+
170+
This scheme is not entirely without issues. `init-rpl-role=SLAVE` means that the old primary
171+
(*server1*) will always discard the unverified transactions during startup, even if the data did
172+
successfully replicate to a replica before the crash (*server1* crashed after sending the data but
173+
before receiving the reply). This is not an issue if failover has already occurred by this point, as
174+
*server1* can just fetch the same writes from the new primary (*server2*). However, if *server1*
175+
comes back online quickly, before failover, it could be behind *server2*: *server1* at gtid 1-1-100
176+
and *server2* at 1-1-101.
177+
178+
```bash
179+
$ maxctrl list servers
180+
┌──────────┬─────────────────┬──────┬─────────────┬────────┬────────────────────┬─────────┬──────────┐
181+
│ Server │ Address │ Port │ Connections │ Status │ Status Info │ GTID │ Monitor │
182+
├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
183+
│ server1 │ 192.168.121.51 │ 3306 │ 0 │ Write │ Primary │ 1-1-100 │ Monitor1 │
184+
├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
185+
│ server2 │ 192.168.121.190 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-1-101 │ Monitor1 │
186+
├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤
187+
│ server3 │ 192.168.121.112 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-1-101 │ Monitor1 │
188+
└──────────┴─────────────────┴──────┴─────────────┴────────┴────────────────────┴─────────┴──────────┘
189+
```
190+
191+
As *server1* is still the primary, MaxScale sends any new writes to it. The next write will get gtid
192+
1-1-101, meaning that both servers are on the same gtid but their actual contents likely
193+
differ. This will cause a replication error at some point.
194+
195+
This means that if a primary server crashes, it must stay down long enough for failover to occur and
196+
to ensure MaxScale does not treat it as the primary once it returns. This can be enforced by
197+
changing the service settings. Run `sudo systemctl edit mariadb.service` on all server machines and
198+
add the following:
199+
200+
```ini
201+
[Service]
202+
RestartSec=1min
203+
```
204+
205+
The configured time needs to be comfortably longer than *monitor_interval* * *failcount* configured
206+
to the MariaDB Monitor in MaxScale.
207+
208+
### Failure scenarios
209+
210+
The setup described above protects against a primary server crashing and diverging from the rest of
211+
the cluster. It does not protect from data loss due to network outages. If the connection to the
212+
primary server is lost during a transaction commit (before the data was replicated), the primary
213+
will eventually apply the transaction to its storage. If a failover occurred during the network
214+
outage, the rest of the cluster has already continued under a new primary, leaving the old one
215+
diverged. This is similar to normal replication.
216+
217+
Some queries are not transactional and may still risk diverging replication. These are typically
218+
DDL-queries such as CREATE TABLE, ALTER TABLE and so on. DDL-queries can be transactional if the
219+
changes they do are “small”, such as renaming a column. Large-scale DDL-queries (e.g. ADD COLUMN to
220+
a table with a billion rows) are more vulnerable. The settings presented in this document were only
221+
tested against simple DML-queries that updated a single row.
222+
223+
### Client perspective
224+
225+
If the client gets an OK to its commit command, then it knows that (assuming no semi-sync timeout
226+
happened) the transaction is in at least two servers. Only the primary has certainly processed the
227+
update at this point, the replica may just have the event in its relay log. This means that
228+
SELECT-queries routed to a replica (e.g. by the ReadWriteSplit-router) may see old data.
229+
230+
If the client does not get an OK due to primary crash, then the situation is more ambiguous:
231+
232+
<ol type="A">
233+
<li>Primary crashed before starting the commit</li>
234+
<li>Primary crashed just before receiving the replica acknowledgement</li>
235+
<li>the primary crashed just as it was about to send the OK</li>
236+
</ol>
237+
238+
In case A, the transaction is lost. In case B, the transaction is present on the replica and will be
239+
visible at some point. In case C, the transaction is present on both servers. Since MaxScale cannot
240+
know which case is in effect, it does not attempt transaction replay (even if configured) and
241+
disconnects the client. It’s up to the client to then reconnect and figure out the status of the
242+
transaction.
243+
244+
### Test summary
245+
246+
The server settings used during testing are below. rpl_semi_sync_master_timeout is given in
247+
milliseconds, rpl_semi_sync_slave_kill_conn_timeout in seconds.
248+
249+
```ini
250+
rpl_semi_sync_master_enabled=ON
251+
rpl_semi_sync_slave_enabled=ON
252+
rpl_semi_sync_master_wait_point=AFTER_SYNC
253+
rpl_semi_sync_master_timeout=6000
254+
rpl_semi_sync_slave_kill_conn_timeout=5
255+
init-rpl-role=SLAVE
256+
gtid_strict_mode=1
257+
log_slave_updates=1
258+
```
259+
260+
The MariaDB Monitor in MaxScale was configured with both *auto_failover* and *auto_rejoin* enabled.
261+
Hundreds of failovers with continuous write traffic succeeded without a diverging old primary. When
262+
*init-rpl-role* was changed to MASTER, replication eventually broke, although this could take some
263+
time.
264+
265+
<sub>_This page is licensed: CC BY-SA / Gnu FDL_</sub>
266+
267+
{% @marketo/form formId="4316" %}

0 commit comments

Comments
 (0)