-
Notifications
You must be signed in to change notification settings - Fork 492
design: add More Zero-Downtime Upgrades design doc #34602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b5035b2 to
262ec33
Compare
| works. We have to audit whether it would work for both the old and new version | ||
| to write at the same time. This is important for builtin tables that are not | ||
| derived from catalog state, for example `mz_sessions`, the audit log, and | ||
| probably others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the audit log derived from catalog state? Are there any audit events we emit not in response to DDL?
In general, it seems problematic for the old environment to keep writing to any builtin tables. Afaik, the new environment completely truncates the builtin tables (except the audit log) when it promotes, and re-populates them. If the old environment continues to write, we could quickly end up with inconsistent data. For example, entries from mz_sessions could be removed twice, or could leak.
| TODO: Figure out if we want to allow background tasks to keep writing. This | ||
| includes, but is not limited to storage usage collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also all the storage-managed collections, right?
| The lived experience for users will be: **no perceived downtime for DQL and | ||
| DML, and an error message about DDL not being allowed during version cutover, | ||
| which should be on the order of 10s of seconds**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be less disruptive to indefinitely hang DDL, instead of returning an error. The client would see a very slow DDL query that eventually fails due to a connection error when the old envd shuts down. Clients should retry on connection errors and the retry would connect to the new envd and succeed. So instead of erroring DDL we would have slow DDL during an upgrade, which doesn't require any special handling. That is unless the client has a timeout configured...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit sneaky, but could work! Erroring is more honest but this is fudging things during the upgrade
| and will serve DQL/DML workload off of the catalog snapshot that it has. An | ||
| important detail to figure out here is what happens when the old-version | ||
| `environmentd` process crashes while we're in a lame-duck phase. If the since | ||
| of the catalog shard has advanced, it will not be able to restart and read the | ||
| catalog at the old version that it understands. This may require holding back | ||
| the since of the catalog shard during upgrades or a similar mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there reason to assume the old envd would restart faster than the new envd? There is some migration work to be done, but hopefully that doesn't increase startup time significantly, right? If we don't think the old envd can overtake the new envd then we don't need to worry about ensuring the old envd can start up again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah, that sounds interesting!
| We have to audit and find out which is which. For builtin tables that mirror | ||
| catalog state we can use the self-correcting approach that we also use for | ||
| materialized views and for a number of builtin storage-managed collections. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the self-correcting approach for builtin tables seems dubious! Consider the case of mz_sessions: The old envd likely has a set of active sessions that it thinks should be in there, the new envd likely has no active sessions. They would fight, with the new envd continually deleting all the contents and the old envd continually inserting them again. When you query mz_sessions you might see the expected contents, or an empty table. If we simply stopped writing the builtin tables from the old envd you would always see an empty mz_sessions table, which doesn't seem much worse, but would be much less effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but I guess "simply stop writing in the old envd" isn't sufficient because we also need to ensure that the old envd stops writing before the new envd has started writing to the builtin tables, which I don't think we have a way to do currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this section is a bit handwave-y, but the general idea is that that we have to do something to allow multiple envds to collaborate. For zero-downtime but also for other things, so at some point we need to think about it and figure it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For builtin tables that mirror
catalog state we can use the self-correcting approach
For those builtin tables that mirror catalog state, I would imagine the situation to not be too complicated: Since DDL is not allowed during the interesting time, the only catalog change is the new envd doing catalog migrations, right? If this is true, then there is no fighting needed between the two envds: Could we simply let the new envd change also the builtin tables when it does the migrations? (And the same might be true for the audit log, if it's true that the audit log changes only when the catalog changes.)
| I think they are ready, but sources will have a bit of a fight over who gets to | ||
| read, for example for Postgres. Kafka is already fine because it already | ||
| supports concurrently running ingestion instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the postgres thing work today? When a source starts up it takes over the replication slot and crashes the other source? So both the old and the new source would be in a restart loop but hopefully still making progress?
Seems like we could just shut down all source dataflows when we enter "lame duck" mode, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's the fighting and crashing. And yes, we could shut them down in lame-duck mode!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, shutting them down in lame-duck mode isn't a good idea because it still takes a bit until the sources in the new leader env start writing. The new envd only sends the AllowWrites command after bootstrapping, so there would be 10s of seconds where the old sources have shut down but the new ones can't write yet.
Better if the new sources could fence out the old ones somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I had that same thought and thought I had actually commented this here, but seems not ... 🙈
262ec33 to
28c37c0
Compare
| reaped, it can still serve queries but not apply writes to the catalog | ||
| anymore, so cannot process DDL queries | ||
| 6. New `environmentd` re-establishes connection to clusters, brings them out of | ||
| read-only mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does the new environmentd apply catalog migrations?
At that point the lame duck old environmentd can no longer serve queries, right? How do we fence out the old one at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrations are applied at step 4, when the new version opens in read-write mode. The point of lame-duck mode is that the old version notices this but instead of halting (as we do today), we keep a snapshot of the catalog as we have it and keep serving queries from that. That way we can serve SELECT and INSERT traffic, but not DDL which would require the old version to still be able to write to the catalog.
That last bit is future work: we want to add forward/backward compat so that they two versions can both keep serving all types of queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true. Couldn't you eg. write into tables that you'd had permissions revoked for, or that were deleted, or had their schema changed? (And similar questions for reads?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true.
Seems fine to me. Since DDL is not allowed, the only change after the snapshot are the catalog migrations that the new envd does, right? Even if that invalidates some operation that a client is doing on the old envd (by e.g. revoking some permission), the user can't really complain, since the user can't expect the catalog migrations to take place at a certain point in time. This is different from e.g. an explicit REVOKE command that the user can expect to take place immediately after it returns successfully to the user. In other words, before the cutover happens, permissions are as if the upgrade were just not taking place yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends a bit on how the network rerouting is implemented. Is it possible to ensure that all clients connect to the new envd at the same time? As in: When the first client issues a query against the new envd, is it guaranteed that no other client can run a query against the old envd subsequently? If not, there is a period where clients are connected to both envds simultaneously and then Ben's revoke scenario becomes relevant.
I'm also wondering how this is going to work in the HA/use-case-isolation work. If a REVOKE was committed on one envd, the other envds will learn about that with a delay. Does that mean we'll have to sync the catalog for every SELECT query that gets executed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean we'll have to sync the catalog for every SELECT query that gets executed?
It does indeed! We wrote about that quite a while in the design doc about the logical architecture of pv2:
| ### ADAPTER |
I think there is no way around it in a distributed system where you still want strict serializability. You wouldn't do the actual implementation as described there, though. What we've had in mind back then is that yes you introduce a catalog timestamp, and you change the API of the timestamp oracle to allow getting two timestamps in one operation. So you get (query ts, catalog ts) when you have a query incoming, instead of just getting a read ts right now. And then you compare the catalog snapshot you already have in memory against that and only need to sync to latest catalog changes when you're outdated.
In practice, this means that you get excellent performance when you don't see many DDL. When you see DDL right when you do your query, you might have to pay to cost to sync to latest catalog changes.
There are some more far-out ideas that we can discuss when we're all at the onsite next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true. Couldn't you eg. write into tables that you'd had permissions revoked for, or that were deleted, or had their schema changed? (And similar questions for reads?)
Yeah, this is the part that also makes me hesitant about the design. The hand-wavy reasoning is that you cannot have catalog changes while the lame-duck envd is serving queries. And once the new version is serving queries the lame-duck one is not getting any more queries. But it's probably true that you cannot guarantee a "100% airtight" cutover from old to new.
IMO, if we determine that we can't go with the lame-duck approach, we still have to do all the suggested work, plug then actual forward/backward compat and following catalog changes. Which will also give us most of use-case isolation as a side effect. I like that, but it's more work so wanted to explore alternatives for delivering something quicker. 🕊️
doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md
Show resolved
Hide resolved
doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md
Show resolved
Hide resolved
| The observation that makes this proposal work is that neither the schema nor | ||
| the contents of user collections (so tables, sources, etc.) change between | ||
| versions. So both the old version and the new version _should_ be able to | ||
| collaborate in writing down source data, accepting INSERTs for tables and | ||
| serving SELECT queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we want to change the persisted data format? Or that is addressed by this later part in this document?:
A jumping-off point for this work is the recent work that allows multiple
versions to work with persist shards. Here we already have forward and backward
compatibility by tracking what versions are still "touching" a shard and
deferring the equivalent of migrations to a moment when no older versions are
touching the shard.
| We have to audit and find out which is which. For builtin tables that mirror | ||
| catalog state we can use the self-correcting approach that we also use for | ||
| materialized views and for a number of builtin storage-managed collections. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For builtin tables that mirror
catalog state we can use the self-correcting approach
For those builtin tables that mirror catalog state, I would imagine the situation to not be too complicated: Since DDL is not allowed during the interesting time, the only catalog change is the new envd doing catalog migrations, right? If this is true, then there is no fighting needed between the two envds: Could we simply let the new envd change also the builtin tables when it does the migrations? (And the same might be true for the audit log, if it's true that the audit log changes only when the catalog changes.)
| reaped, it can still serve queries but not apply writes to the catalog | ||
| anymore, so cannot process DDL queries | ||
| 6. New `environmentd` re-establishes connection to clusters, brings them out of | ||
| read-only mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true.
Seems fine to me. Since DDL is not allowed, the only change after the snapshot are the catalog migrations that the new envd does, right? Even if that invalidates some operation that a client is doing on the old envd (by e.g. revoking some permission), the user can't really complain, since the user can't expect the catalog migrations to take place at a certain point in time. This is different from e.g. an explicit REVOKE command that the user can expect to take place immediately after it returns successfully to the user. In other words, before the cutover happens, permissions are as if the upgrade were just not taking place yet.
| We can even render that as a dataflow and attempt the read-then-write directly | ||
| on a cluster, if you squint this is almost a "one-shot continual task" (if | ||
| you're familiar with that). But we could initially keep that loop in | ||
| `environmentd`, closer to how the current implementation works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(You could link https://github.com/MaterializeInc/database-issues/issues/6686 here.)
|
|
||
| We cut over network routes once the new deployment is fully ready, so any | ||
| residual downtime is the route change itself. During that window the old | ||
| deployment still accepts connections but rejects DDL with an error message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really clear to me what "that window" refers to in this sentence. I would imagine the change in balancerd of starting to route incoming connections to the new envd to be instantaneous. Or does the "that window" refer to the time period from the previous paragraph instead, i.e., the time of the "lame-duck mode"?
6c7054c to
df05e5c
Compare
df05e5c to
87514e9
Compare
Rendered: https://github.com/aljoscha/materialize/blob/design-more-zero-downtime/doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_high_availability.md