design: add More Zero-Downtime Upgrades design doc #34602

aljoscha · 2025-12-22T13:24:03Z

Rendered: https://github.com/aljoscha/materialize/blob/design-more-zero-downtime/doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_high_availability.md

teskje · 2026-01-07T11:04:40Z

...eveloper/design/20251219_more_zero_downtime_upgrades_physical_isolation_high_availability.md

+works. We have to audit whether it would work for both the old and new version
+to write at the same time. This is important for builtin tables that are not
+derived from catalog state, for example `mz_sessions`, the audit log, and
+probably others.


Isn't the audit log derived from catalog state? Are there any audit events we emit not in response to DDL?

In general, it seems problematic for the old environment to keep writing to any builtin tables. Afaik, the new environment completely truncates the builtin tables (except the audit log) when it promotes, and re-populates them. If the old environment continues to write, we could quickly end up with inconsistent data. For example, entries from mz_sessions could be removed twice, or could leak.

teskje · 2026-01-07T11:06:40Z

...eveloper/design/20251219_more_zero_downtime_upgrades_physical_isolation_high_availability.md

+TODO: Figure out if we want to allow background tasks to keep writing. This
+includes, but is not limited to storage usage collection.


Also all the storage-managed collections, right?

teskje · 2026-01-07T11:10:20Z

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

+The lived experience for users will be: **no perceived downtime for DQL and
+DML, and an error message about DDL not being allowed during version cutover,
+which should be on the order of 10s of seconds**.


I wonder if it would be less disruptive to indefinitely hang DDL, instead of returning an error. The client would see a very slow DDL query that eventually fails due to a connection error when the old envd shuts down. Clients should retry on connection errors and the retry would connect to the new envd and succeed. So instead of erroring DDL we would have slow DDL during an upgrade, which doesn't require any special handling. That is unless the client has a timeout configured...

A bit sneaky, but could work! Erroring is more honest but this is fudging things during the upgrade

teskje · 2026-01-07T11:13:41Z

...eveloper/design/20251219_more_zero_downtime_upgrades_physical_isolation_high_availability.md

+and will serve DQL/DML workload off of the catalog snapshot that it has. An
+important detail to figure out here is what happens when the old-version
+`environmentd` process crashes while we're in a lame-duck phase. If the since
+of the catalog shard has advanced, it will not be able to restart and read the
+catalog at the old version that it understands. This may require holding back
+the since of the catalog shard during upgrades or a similar mechanism.


Is there reason to assume the old envd would restart faster than the new envd? There is some migration work to be done, but hopefully that doesn't increase startup time significantly, right? If we don't think the old envd can overtake the new envd then we don't need to worry about ensuring the old envd can start up again.

ah yeah, that sounds interesting!

teskje · 2026-01-07T11:23:09Z

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

+We have to audit and find out which is which. For builtin tables that mirror
+catalog state we can use the self-correcting approach that we also use for
+materialized views and for a number of builtin storage-managed collections. For


Using the self-correcting approach for builtin tables seems dubious! Consider the case of mz_sessions: The old envd likely has a set of active sessions that it thinks should be in there, the new envd likely has no active sessions. They would fight, with the new envd continually deleting all the contents and the old envd continually inserting them again. When you query mz_sessions you might see the expected contents, or an empty table. If we simply stopped writing the builtin tables from the old envd you would always see an empty mz_sessions table, which doesn't seem much worse, but would be much less effort.

Ah, but I guess "simply stop writing in the old envd" isn't sufficient because we also need to ensure that the old envd stops writing before the new envd has started writing to the builtin tables, which I don't think we have a way to do currently.

Yes, this section is a bit handwave-y, but the general idea is that that we have to do something to allow multiple envds to collaborate. For zero-downtime but also for other things, so at some point we need to think about it and figure it out.

For builtin tables that mirror
catalog state we can use the self-correcting approach

For those builtin tables that mirror catalog state, I would imagine the situation to not be too complicated: Since DDL is not allowed during the interesting time, the only catalog change is the new envd doing catalog migrations, right? If this is true, then there is no fighting needed between the two envds: Could we simply let the new envd change also the builtin tables when it does the migrations? (And the same might be true for the audit log, if it's true that the audit log changes only when the catalog changes.)

teskje · 2026-01-07T11:28:31Z

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

+I think they are ready, but sources will have a bit of a fight over who gets to
+read, for example for Postgres. Kafka is already fine because it already
+supports concurrently running ingestion instances.


How does the postgres thing work today? When a source starts up it takes over the replication slot and crashes the other source? So both the old and the new source would be in a restart loop but hopefully still making progress?

Seems like we could just shut down all source dataflows when we enter "lame duck" mode, no?

Yep, it's the fighting and crashing. And yes, we could shut them down in lame-duck mode!

Thinking more about this, shutting them down in lame-duck mode isn't a good idea because it still takes a bit until the sources in the new leader env start writing. The new envd only sends the AllowWrites command after bootstrapping, so there would be 10s of seconds where the old sources have shut down but the new ones can't write yet.

Better if the new sources could fence out the old ones somehow.

Hmm, I had that same thought and thought I had actually commented this here, but seems not ... 🙈

alex-hunt-materialize · 2026-01-13T13:10:57Z

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

+   reaped, it can still serve queries but not apply writes to the catalog
+   anymore, so cannot process DDL queries
+6. New `environmentd` re-establishes connection to clusters, brings them out of
+   read-only mode


When does the new environmentd apply catalog migrations?

At that point the lame duck old environmentd can no longer serve queries, right? How do we fence out the old one at that point?

Migrations are applied at step 4, when the new version opens in read-write mode. The point of lame-duck mode is that the old version notices this but instead of halting (as we do today), we keep a snapshot of the catalog as we have it and keep serving queries from that. That way we can serve SELECT and INSERT traffic, but not DDL which would require the old version to still be able to write to the catalog.

That last bit is future work: we want to add forward/backward compat so that they two versions can both keep serving all types of queries.

I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true. Couldn't you eg. write into tables that you'd had permissions revoked for, or that were deleted, or had their schema changed? (And similar questions for reads?)

I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true.

Seems fine to me. Since DDL is not allowed, the only change after the snapshot are the catalog migrations that the new envd does, right? Even if that invalidates some operation that a client is doing on the old envd (by e.g. revoking some permission), the user can't really complain, since the user can't expect the catalog migrations to take place at a certain point in time. This is different from e.g. an explicit REVOKE command that the user can expect to take place immediately after it returns successfully to the user. In other words, before the cutover happens, permissions are as if the upgrade were just not taking place yet.

This depends a bit on how the network rerouting is implemented. Is it possible to ensure that all clients connect to the new envd at the same time? As in: When the first client issues a query against the new envd, is it guaranteed that no other client can run a query against the old envd subsequently? If not, there is a period where clients are connected to both envds simultaneously and then Ben's revoke scenario becomes relevant.

I'm also wondering how this is going to work in the HA/use-case-isolation work. If a REVOKE was committed on one envd, the other envds will learn about that with a delay. Does that mean we'll have to sync the catalog for every SELECT query that gets executed?

Does that mean we'll have to sync the catalog for every SELECT query that gets executed?

It does indeed! We wrote about that quite a while in the design doc about the logical architecture of pv2:

materialize/doc/developer/design/20231127_pv2_uci_logical_architecture.md

Line 376 in 3a597fb

### ADAPTER

I think there is no way around it in a distributed system where you still want strict serializability. You wouldn't do the actual implementation as described there, though. What we've had in mind back then is that yes you introduce a catalog timestamp, and you change the API of the timestamp oracle to allow getting two timestamps in one operation. So you get (query ts, catalog ts) when you have a query incoming, instead of just getting a read ts right now. And then you compare the catalog snapshot you already have in memory against that and only need to sync to latest catalog changes when you're outdated.

In practice, this means that you get excellent performance when you don't see many DDL. When you see DDL right when you do your query, you might have to pay to cost to sync to latest catalog changes.

There are some more far-out ideas that we can discuss when we're all at the onsite next week.

I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true. Couldn't you eg. write into tables that you'd had permissions revoked for, or that were deleted, or had their schema changed? (And similar questions for reads?)

Yeah, this is the part that also makes me hesitant about the design. The hand-wavy reasoning is that you cannot have catalog changes while the lame-duck envd is serving queries. And once the new version is serving queries the lame-duck one is not getting any more queries. But it's probably true that you cannot guarantee a "100% airtight" cutover from old to new.

IMO, if we determine that we can't go with the lame-duck approach, we still have to do all the suggested work, plug then actual forward/backward compat and following catalog changes. Which will also give us most of use-case isolation as a side effect. I like that, but it's more work so wanted to explore alternatives for delivering something quicker. 🕊️

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

ggevay · 2026-01-18T16:11:06Z

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

+The observation that makes this proposal work is that neither the schema nor
+the contents of user collections (so tables, sources, etc.) change between
+versions. So both the old version and the new version _should_ be able to
+collaborate in writing down source data, accepting INSERTs for tables and
+serving SELECT queries.


What if we want to change the persisted data format? Or that is addressed by this later part in this document?:

A jumping-off point for this work is the recent work that allows multiple
versions to work with persist shards. Here we already have forward and backward
compatibility by tracking what versions are still "touching" a shard and
deferring the equivalent of migrations to a moment when no older versions are
touching the shard.

ggevay · 2026-01-18T16:18:23Z

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

+We have to audit and find out which is which. For builtin tables that mirror
+catalog state we can use the self-correcting approach that we also use for
+materialized views and for a number of builtin storage-managed collections. For


For builtin tables that mirror
catalog state we can use the self-correcting approach

For those builtin tables that mirror catalog state, I would imagine the situation to not be too complicated: Since DDL is not allowed during the interesting time, the only catalog change is the new envd doing catalog migrations, right? If this is true, then there is no fighting needed between the two envds: Could we simply let the new envd change also the builtin tables when it does the migrations? (And the same might be true for the audit log, if it's true that the audit log changes only when the catalog changes.)

ggevay · 2026-01-18T16:32:07Z

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

+   reaped, it can still serve queries but not apply writes to the catalog
+   anymore, so cannot process DDL queries
+6. New `environmentd` re-establishes connection to clusters, brings them out of
+   read-only mode


I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true.

Seems fine to me. Since DDL is not allowed, the only change after the snapshot are the catalog migrations that the new envd does, right? Even if that invalidates some operation that a client is doing on the old envd (by e.g. revoking some permission), the user can't really complain, since the user can't expect the catalog migrations to take place at a certain point in time. This is different from e.g. an explicit REVOKE command that the user can expect to take place immediately after it returns successfully to the user. In other words, before the cutover happens, permissions are as if the upgrade were just not taking place yet.

ggevay · 2026-01-18T16:43:02Z

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

+We can even render that as a dataflow and attempt the read-then-write directly
+on a cluster, if you squint this is almost a "one-shot continual task" (if
+you're familiar with that). But we could initially keep that loop in
+`environmentd`, closer to how the current implementation works.


(You could link https://github.com/MaterializeInc/database-issues/issues/6686 here.)

ggevay · 2026-01-18T17:20:44Z

doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

+
+We cut over network routes once the new deployment is fully ready, so any
+residual downtime is the route change itself. During that window the old
+deployment still accepts connections but rejects DDL with an error message.


It's not really clear to me what "that window" refers to in this sentence. I would imagine the change in balancerd of starting to route incoming connections to the new envd to be instantaneous. Or does the "that window" refer to the time period from the previous paragraph instead, i.e., the time of the "lame-duck mode"?

aljoscha force-pushed the design-more-zero-downtime branch from b5035b2 to 262ec33 Compare January 6, 2026 15:55

teskje reviewed Jan 7, 2026

View reviewed changes

aljoscha force-pushed the design-more-zero-downtime branch from 262ec33 to 28c37c0 Compare January 12, 2026 11:04

alex-hunt-materialize reviewed Jan 13, 2026

View reviewed changes

ggevay reviewed Jan 18, 2026

View reviewed changes

aljoscha added 4 commits January 28, 2026 11:35

design: add More Zero-Downtime Upgrades design doc

2f0986b

address easy comments

f489a40

type up alternatives

1169f38

clarify some sections, framing

ebf5c13

aljoscha force-pushed the design-more-zero-downtime branch from 6c7054c to df05e5c Compare January 28, 2026 10:36

aljoscha added 2 commits January 28, 2026 11:59

update flow, v2

27da602

more commentary

87514e9

aljoscha force-pushed the design-more-zero-downtime branch from df05e5c to 87514e9 Compare January 28, 2026 11:06

		TODO: Figure out if we want to allow background tasks to keep writing. This
		includes, but is not limited to storage usage collection.

design: add More Zero-Downtime Upgrades design doc #34602

Are you sure you want to change the base?

design: add More Zero-Downtime Upgrades design doc #34602

Uh oh!

Conversation

aljoscha commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aljoscha Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aljoscha commented Dec 22, 2025 •

edited

Loading

aljoscha Jan 26, 2026 •

edited

Loading