|
| 1 | +# Failure-tolerant replication and failover |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +The goal of this guide is to set up a replication cluster managed by MaxScale that is reasonably |
| 6 | +tolerant of failures, i.e. even if one part fails, the cluster continues to work. Additionally, |
| 7 | +transaction data should be preserved whenever possible. All of this should work automatically. |
| 8 | + |
| 9 | +This guide assumes that the reader is familiar with MariaDB replication and GTIDs, |
| 10 | +[MariaDB Monitor](../reference/maxscale-monitors/mariadb-monitor.md), and |
| 11 | +[failover](automatic-failover-with-mariadb-monitor.md). |
| 12 | + |
| 13 | +## The problem with normal replication |
| 14 | + |
| 15 | +The basic problem of replication is that the primary and replica are not always in the same |
| 16 | +state. When a commit is performed on the primary, the primary updates both the actual database file |
| 17 | +and the binary log. These items are updated in a transactional manner, either both succeed or |
| 18 | +fail. Then, the primary sends the binary log event to the replicas and they update their own |
| 19 | +databases and logs. |
| 20 | + |
| 21 | +A replica may crash or lose connection to the primary. Fortunately, this is not a big issue as once |
| 22 | +the replica returns, it can simply resume replication from where it left off. The replica cannot |
| 23 | +diverge as it is always either in the same state as the primary, or behind. Only if the primary |
| 24 | +lacks the binary logs from the moment the replica went down, is the replica lost. |
| 25 | + |
| 26 | +If the primary crashes or loses network connection, failover may lose data. This depends on at which |
| 27 | +point the crash happens: |
| 28 | + |
| 29 | +<ol type="A"> |
| 30 | + <li>If the primary managed to send all committed transactions to a replica, then all is still |
| 31 | +well. The replica has all the data, and can be promoted to primary e.g. by a MaxScale (MaxScale will |
| 32 | +promote the most up-to-date replica). Once the old primary returns, it can rejoin the cluster.</li> |
| 33 | + <li>If the primary crashes just after it committed a transaction and updated its binary log, but |
| 34 | +before it sent the binary log event to a replica, then failover loses data and the old primary can |
| 35 | +no longer rejoin the cluster.</li> |
| 36 | +</ol> |
| 37 | + |
| 38 | +Let’s look at situation B in more detail. *server1* is the original primary as and its replicas are |
| 39 | +*server2* and *server3*, with server ids 1,2 and 3, respectively. *server1* is at gtid 1-1-101 when |
| 40 | +it crashes while the others have replicated to the previous event 1-1-100. The example server status |
| 41 | +output below is for demonstration only, since in reality it would be unlikely that the monitor would |
| 42 | +manage to update the gtid-position of *server1* right at the moment of crash. |
| 43 | + |
| 44 | +```bash |
| 45 | +$ maxctrl list servers |
| 46 | +┌──────────┬─────────────────┬──────┬─────────────┬────────┬────────────────────┬─────────┬──────────┐ |
| 47 | +│ Server │ Address │ Port │ Connections │ Status │ Status Info │ GTID │ Monitor │ |
| 48 | +├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤ |
| 49 | +│ server1 │ 192.168.121.51 │ 3306 │ 0 │ Write │ Down │ 1-1-101 │ Monitor1 │ |
| 50 | +├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤ |
| 51 | +│ server2 │ 192.168.121.190 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-1-100 │ Monitor1 │ |
| 52 | +├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤ |
| 53 | +│ server3 │ 192.168.121.112 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-1-100 │ Monitor1 │ |
| 54 | +└──────────┴─────────────────┴──────┴─────────────┴────────┴────────────────────┴─────────┴──────────┘ |
| 55 | +``` |
| 56 | + |
| 57 | +*server1* stays down long enough for failover to activate (in MaxScale, the time is roughly |
| 58 | +*monitor_interval* * *failcount*). *server2* gets promoted, and MaxScale routes any new writes to |
| 59 | +it. *server2* starts generating binary log events with gtids 1-2-101, 1-2-102 and so on. If |
| 60 | +*server1* now comes back online, it can no longer rejoin as it is at gtid 1-1-101, which conflicts |
| 61 | +with 1-2-101. |
| 62 | + |
| 63 | +```bash |
| 64 | +$ maxctrl list servers |
| 65 | +┌──────────┬─────────────────┬──────┬─────────────┬────────┬────────────────────┬─────────┬──────────┐ |
| 66 | +│ Server │ Address │ Port │ Connections │ Status │ Status Info │ GTID │ Monitor │ |
| 67 | +├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤ |
| 68 | +│ server1 │ 192.168.121.51 │ 3306 │ 0 │ Up │ │ 1-1-101 │ Monitor1 │ |
| 69 | +├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤ |
| 70 | +│ server2 │ 192.168.121.190 │ 3306 │ 0 │ Write │ Primary │ 1-2-102 │ Monitor1 │ |
| 71 | +├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤ |
| 72 | +│ server3 │ 192.168.121.112 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-2-102 │ Monitor1 │ |
| 73 | +└──────────┴─────────────────┴──────┴─────────────┴────────┴────────────────────┴─────────┴──────────┘ |
| 74 | +``` |
| 75 | + |
| 76 | +At this point, the DBA could forcefully alter the gtid of *server1*, setting it to 1-1-100, which is |
| 77 | +in *server2*’s binary log, enabling rejoin. This is usually ill-advised, as changing the gtid does |
| 78 | +not rollback the actual data in *server2*’s database, meaning that data conflicts can still happen, |
| 79 | +perhaps days later. A more reliable way to handle this case is to rebuild *server1* from one of the |
| 80 | +other servers (MaxScale can help with this process, but it requires configuration and [manual |
| 81 | +launching](../reference/maxscale-monitors/mariadb-monitor.md#backup-operations)). |
| 82 | + |
| 83 | +If the old primary returns before failover activates, then replication can continue regardless of |
| 84 | +what exact moment the crash happened. **This means that the DBA should configure automatic failover |
| 85 | +to happen only after the primary has been down so long that the downsides of service outage outweigh |
| 86 | +the threat of losing data and having to rebuild the old primary.** *monitor_interval* * *failcount* |
| 87 | +should at minimum be large enough so that failover does not trigger due to a momentary network |
| 88 | +failure. |
| 89 | + |
| 90 | +## Semi-synchronous replication |
| 91 | + |
| 92 | +[Semi-synchronous replication](https://mariadb.com/docs/server/ha-and-performance/standard-replication/semisynchronous-replication) |
| 93 | +offers a more reliable, but also a more complicated way to keep the cluster in sync. Semi-sync |
| 94 | +replication means that after the primary commits a transaction, it does not immediately return an OK |
| 95 | +to the client. The primary instead sends the binary log update to the replicas and waits for an |
| 96 | +acknowledgement from at least one replica before sending the OK-message back to the client. This |
| 97 | +means that once the client gets the OK, the transaction data is typically on at least two |
| 98 | +servers. This is not absolutely certain, as the primary does not wait forever for the replica |
| 99 | +acknowledgement. If no replica responds in time, the primary switches to normal replication and |
| 100 | +returns OK to the client. This timeout is controlled by MariaDB Server setting |
| 101 | +**rpl_semi_sync_master_timeout**. If this limit is being hit, the client should notice it by the |
| 102 | +transaction visibly stalling. Even if this limit is not hit, throughput suffers compared to normal |
| 103 | +replication due to the additional waiting. |
| 104 | + |
| 105 | +Semi-synchronous replication by itself does not remove the possibility of the primary diverging |
| 106 | +after a crash. The scenario B presented in the previous section can still happen if the primary |
| 107 | +crashes after applying the transaction but before sending it to replicas. To increase crash safety, |
| 108 | +a few settings need to be tuned to change the behavior of the primary server both during replication |
| 109 | +and during startup after a crash (crash recovery). |
| 110 | + |
| 111 | +### Enable semi-synchronous replication |
| 112 | + |
| 113 | +To enable semi semi-synchronous replication, add the following to the configuration files of all |
| 114 | +servers: |
| 115 | + |
| 116 | +```ini |
| 117 | +rpl_semi_sync_master_enabled=ON |
| 118 | +rpl_semi_sync_slave_enabled=ON |
| 119 | +``` |
| 120 | + |
| 121 | +These settings allow the servers to act as both semi-sync primary and replica, which is useful when |
| 122 | +combined with automatic failover. Restart the servers and run `show status like 'rpl%';`, it shows |
| 123 | +semi-sync related status variables. Check the values of Rpl_semi_sync_master_clients, |
| 124 | +Rpl_semi_sync_master_status and Rpl_semi_sync_slave_status. On the primary, their values should be: |
| 125 | + |
| 126 | +1. Rpl_semi_sync_master_clients \<number of replicas\> |
| 127 | +2. Rpl_semi_sync_master_status ON |
| 128 | +3. Rpl_semi_sync_slave_status OFF |
| 129 | + |
| 130 | +On the replicas, the values should be: |
| 131 | + |
| 132 | +1. Rpl_semi_sync_master_clients 0 |
| 133 | +2. Rpl_semi_sync_master_status ON |
| 134 | +3. Rpl_semi_sync_slave_status ON |
| 135 | + |
| 136 | +### Configure wait point and startup role |
| 137 | + |
| 138 | +Next, change the point at which the primary server waits for the replica acknowledgement. By |
| 139 | +default, this is after the transaction is written to storage, which is not much different compared |
| 140 | +to normal replication. Set the following in the server configuration files: |
| 141 | + |
| 142 | +```ini |
| 143 | +rpl_semi_sync_master_wait_point=AFTER_SYNC |
| 144 | +``` |
| 145 | + |
| 146 | +AFTER_SYNC means that the primary sends the binary log event to replicas after writing it to the |
| 147 | +binary log but before committing the actual transaction to its own storage. The primary only updates |
| 148 | +its internal storage once at least one replica gives the reply or the rpl_semi_sync_master_timeout |
| 149 | +is reached. More importantly, this means that if the primary crashes while waiting for the reply, |
| 150 | +its binary log and storage engine will be out of sync. On startup, the server must thus decide what |
| 151 | +to do: either consider the binary log correct and apply the missing transactions to storage or |
| 152 | +discard the unverified binary log events. |
| 153 | + |
| 154 | +As of MariaDB Server 10.6.19, 10.11.9, 11.1.6, 11.2.5, 11.4.3 and 11.5.2, this decision is |
| 155 | +controlled by the startup option |
| 156 | +[init-rpl-role](https://mariadb.com/docs/server/ha-and-performance/standard-replication/semisynchronous-replication#init-rpl-role). |
| 157 | +If set to MASTER, the server applies the transactions during startup, as it assumes to still be |
| 158 | +the primary. If init-rpl-role is set to SLAVE, the server discards the transactions. The former |
| 159 | +option does not improve the situation after a failover, as the primary could apply transactions that |
| 160 | +never made it to a replica. The latter option, on the other hand, ensures that when the old primary |
| 161 | +comes back online, it does not have conflicting transactions and can rejoin the cluster as a |
| 162 | +replica. So, add the following to all server configurations: |
| 163 | + |
| 164 | +```ini |
| 165 | +init-rpl-role=SLAVE |
| 166 | +``` |
| 167 | + |
| 168 | +### Configure service restart delay |
| 169 | + |
| 170 | +This scheme is not entirely without issues. `init-rpl-role=SLAVE` means that the old primary |
| 171 | +(*server1*) will always discard the unverified transactions during startup, even if the data did |
| 172 | +successfully replicate to a replica before the crash (*server1* crashed after sending the data but |
| 173 | +before receiving the reply). This is not an issue if failover has already occurred by this point, as |
| 174 | +*server1* can just fetch the same writes from the new primary (*server2*). However, if *server1* |
| 175 | +comes back online quickly, before failover, it could be behind *server2*: *server1* at gtid 1-1-100 |
| 176 | +and *server2* at 1-1-101. |
| 177 | + |
| 178 | +```bash |
| 179 | +$ maxctrl list servers |
| 180 | +┌──────────┬─────────────────┬──────┬─────────────┬────────┬────────────────────┬─────────┬──────────┐ |
| 181 | +│ Server │ Address │ Port │ Connections │ Status │ Status Info │ GTID │ Monitor │ |
| 182 | +├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤ |
| 183 | +│ server1 │ 192.168.121.51 │ 3306 │ 0 │ Write │ Primary │ 1-1-100 │ Monitor1 │ |
| 184 | +├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤ |
| 185 | +│ server2 │ 192.168.121.190 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-1-101 │ Monitor1 │ |
| 186 | +├──────────┼─────────────────┼──────┼─────────────┼────────┼────────────────────┼─────────┼──────────┤ |
| 187 | +│ server3 │ 192.168.121.112 │ 3306 │ 0 │ Read │ Replica, read_only │ 1-1-101 │ Monitor1 │ |
| 188 | +└──────────┴─────────────────┴──────┴─────────────┴────────┴────────────────────┴─────────┴──────────┘ |
| 189 | +``` |
| 190 | + |
| 191 | +As *server1* is still the primary, MaxScale sends any new writes to it. The next write will get gtid |
| 192 | +1-1-101, meaning that both servers are on the same gtid but their actual contents likely |
| 193 | +differ. This will cause a replication error at some point. |
| 194 | + |
| 195 | +This means that if a primary server crashes, it must stay down long enough for failover to occur and |
| 196 | +to ensure MaxScale does not treat it as the primary once it returns. This can be enforced by |
| 197 | +changing the service settings. Run `sudo systemctl edit mariadb.service` on all server machines and |
| 198 | +add the following: |
| 199 | + |
| 200 | +```ini |
| 201 | +[Service] |
| 202 | +RestartSec=1min |
| 203 | +``` |
| 204 | + |
| 205 | +The configured time needs to be comfortably longer than *monitor_interval* * *failcount* configured |
| 206 | +to the MariaDB Monitor in MaxScale. |
| 207 | + |
| 208 | +### Failure scenarios |
| 209 | + |
| 210 | +The setup described above protects against a primary server crashing and diverging from the rest of |
| 211 | +the cluster. It does not protect from data loss due to network outages. If the connection to the |
| 212 | +primary server is lost during a transaction commit (before the data was replicated), the primary |
| 213 | +will eventually apply the transaction to its storage. If a failover occurred during the network |
| 214 | +outage, the rest of the cluster has already continued under a new primary, leaving the old one |
| 215 | +diverged. This is similar to normal replication. |
| 216 | + |
| 217 | +Some queries are not transactional and may still risk diverging replication. These are typically |
| 218 | +DDL-queries such as CREATE TABLE, ALTER TABLE and so on. DDL-queries can be transactional if the |
| 219 | +changes they do are “small”, such as renaming a column. Large-scale DDL-queries (e.g. ADD COLUMN to |
| 220 | +a table with a billion rows) are more vulnerable. The settings presented in this document were only |
| 221 | +tested against simple DML-queries that updated a single row. |
| 222 | + |
| 223 | +### Client perspective |
| 224 | + |
| 225 | +If the client gets an OK to its commit command, then it knows that (assuming no semi-sync timeout |
| 226 | +happened) the transaction is in at least two servers. Only the primary has certainly processed the |
| 227 | +update at this point, the replica may just have the event in its relay log. This means that |
| 228 | +SELECT-queries routed to a replica (e.g. by the ReadWriteSplit-router) may see old data. |
| 229 | + |
| 230 | +If the client does not get an OK due to primary crash, then the situation is more ambiguous: |
| 231 | + |
| 232 | +<ol type="A"> |
| 233 | + <li>Primary crashed before starting the commit</li> |
| 234 | + <li>Primary crashed just before receiving the replica acknowledgement</li> |
| 235 | + <li>the primary crashed just as it was about to send the OK</li> |
| 236 | +</ol> |
| 237 | + |
| 238 | +In case A, the transaction is lost. In case B, the transaction is present on the replica and will be |
| 239 | +visible at some point. In case C, the transaction is present on both servers. Since MaxScale cannot |
| 240 | +know which case is in effect, it does not attempt transaction replay (even if configured) and |
| 241 | +disconnects the client. It’s up to the client to then reconnect and figure out the status of the |
| 242 | +transaction. |
| 243 | + |
| 244 | +### Test summary |
| 245 | + |
| 246 | +The server settings used during testing are below. rpl_semi_sync_master_timeout is given in |
| 247 | +milliseconds, rpl_semi_sync_slave_kill_conn_timeout in seconds. |
| 248 | + |
| 249 | +```ini |
| 250 | +rpl_semi_sync_master_enabled=ON |
| 251 | +rpl_semi_sync_slave_enabled=ON |
| 252 | +rpl_semi_sync_master_wait_point=AFTER_SYNC |
| 253 | +rpl_semi_sync_master_timeout=6000 |
| 254 | +rpl_semi_sync_slave_kill_conn_timeout=5 |
| 255 | +init-rpl-role=SLAVE |
| 256 | +gtid_strict_mode=1 |
| 257 | +log_slave_updates=1 |
| 258 | +``` |
| 259 | + |
| 260 | +The MariaDB Monitor in MaxScale was configured with both *auto_failover* and *auto_rejoin* enabled. |
| 261 | +Hundreds of failovers with continuous write traffic succeeded without a diverging old primary. When |
| 262 | +*init-rpl-role* was changed to MASTER, replication eventually broke, although this could take some |
| 263 | +time. |
| 264 | + |
| 265 | +<sub>_This page is licensed: CC BY-SA / Gnu FDL_</sub> |
| 266 | + |
| 267 | +{% @marketo/form formId="4316" %} |
0 commit comments