1- # More Zero-Downtime Upgrades, Physical Isolation, High Availability
2-
3- We currently do zero-downtime upgrades, where we hydrate an ` environmentd ` and
4- its ` clusterd ` processes at a new version before cutting over to that version.
5- This makes it so that computation running on clusters is ready when we cut over
6- and there is no downtime in processing, no lag in freshness. However, due to an
7- implementation detail of how cutting over from one version to the next works
8- (see context below) we have an interval of 10–30 seconds where an environment
9- is not reachable by the user.
10-
11- This document proposes a design where we can achieve true zero-downtime
12- upgrades for DML and DQL queries, with the caveat that we still have a window
13- (about 10-30 seconds) where users cannot issue DDL.
14-
15- The long-term goal is to remove the caveats from zero-downtime upgrades and to
16- remove downtime at all moments. The latter is usually tracked under the
17- separate initiatives of _ high availability_ and _ physical isolation_ . However,
18- the engineering work that they all require overlaps and they thematically fit
19- together so we also talk about them briefly below.
20-
21- The focus of this document is on improving zero-downtime upgrades in the short
22- term, but I will describe below what the other initiatives entail, what
23- engineering work is unique to each and what work overlaps, and sketch how we
24- can achieve the long-term goal.
1+ # Zero-Downtime Upgrades, Physical Isolation, and High Availability
2+
3+ Zero-downtime upgrades, high availability, and physical isolation are related
4+ goals that all require multiple ` environmentd ` instances to coexist and
5+ collaborate on shared durable state. One key insight connects them:
6+
7+ ** Zero-downtime upgrades are high availability across two versions.**
8+
9+ The ultimate goal for zero-downtime upgrades is to have two versions of
10+ Materialize running concurrently, serving traffic with no downtime, where we
11+ can cut over network routes when we're confident the new version is ready. This
12+ also enables us to roll out a new version, observe its behavior, and abandon
13+ the update if something isn't right -- true flexibility in version management.
14+
15+ This document proposes an incremental step toward that vision: a "lame-duck"
16+ upgrade procedure that achieves true zero-downtime for DML and DQL queries.
17+ This is not a detour from the end goal but a direct stepping stone: it requires
18+ us to build the foundational capabilities that all three initiatives need --
19+ running multiple ` environmentd ` instances with concurrent access to shared
20+ state.
21+
22+ We currently have a brief window (10-30 seconds) where an environment is
23+ unreachable during upgrades (see Context for details). This proposal eliminates
24+ that unreachability for DML and DQL, with DDL seeing a brief window where it
25+ cannot proceed.
2526
2627## Goals
2728
@@ -105,13 +106,8 @@ availability and zero-downtime upgrades.
105106
106107## Conceptual Framework: How These Initiatives Relate
107108
108- Zero-downtime upgrades, high availability, and physical isolation all require
109- multiple ` environmentd ` instances to coexist and collaborate on shared durable
110- state (the catalog). One neat insight is:
111-
112- ** Zero-downtime upgrades are high availability across two versions.**
113-
114- This framing clarifies what makes each initiative unique:
109+ Given that zero-downtime upgrades are high availability across two versions,
110+ this framing clarifies what makes each initiative unique:
115111
116112| Initiative | # of Versions | Duration | Primary Challenge |
117113| ------------| ---------------| ----------| -------------------|
@@ -120,12 +116,14 @@ This framing clarifies what makes each initiative unique:
120116| physical isolation | 1 (typically) | permanent | workload routing |
121117
122118All three share a common foundation: the ability to run multiple ` environmentd `
123- instances that can read/write shared state concurrently.
119+ instances that can read/write shared state concurrently. The work proposed in
120+ this document directly builds that foundation, making it a stepping stone toward
121+ the full vision rather than a parallel effort.
124122
125- ## Proposal
123+ ## Proposal (lame-duck upgrades)
126124
127- To reduce downtime during upgrades further , I propose that we change from our
128- current upgrade procedure to this flow:
125+ As the next incremental milestone for zero-downtime upgrades , I propose that we
126+ change from our current upgrade procedure to this flow:
129127
1301281 . New ` environmentd ` starts with higher ` deploy_generation ` /` version `
1311292 . Boots in read-only mode: opens catalog in read-only mode, spawns ` clusterd `
@@ -157,25 +155,16 @@ version.
157155
158156Implicit in this proposal is that we initially still don't want to support DDL
159157during the upgrade window. In addition to all the proposed work, this would
160- require two additional large pieces of engineering work that I think are hard
161- enough that we want to postpone them and instead deliver this incremental
162- improvement to our upgrade procedure. I will expand on this future work below.
158+ require two additional pieces of engineering work. The lame-duck approach lets
159+ us deliver immediate value while building the foundation those pieces will need.
160+ I expand on this future work below.
163161
164162The proposed upgrade flow requires a number of changes across different
165163components. I will sketch these below, but each of the sub-sections will
166164require a small-ish design document of its own or at the very least a thorough
167165GitHub issue.
168166
169- The required work is sectioned into work that is unique to zero-downtime
170- upgrades, work that is shared with the other initiatives, and then lastly I
171- will describe the work that is unique to future initiatives. The latter will be
172- very sparse, because the focus of this document is the more immediate
173- improvements to zero-downtime upgrades.
174-
175- ## Work required for Zero-Downtime Upgrades (for DML/DQL)
176-
177- This section and the one about the shared foundation work describe that work
178- that is required for this proposal.
167+ ## Work required for Zero-Downtime Upgrades (lame-duck upgrades)
179168
180169### Lame-Duck ` environmentd ` at Old Version
181170
@@ -239,7 +228,8 @@ approach.
239228## Foundation Work (Required for This Proposal)
240229
241230The following changes are required for this proposal and form the foundation
242- for High Availability and Physical Isolation:
231+ for working towards the full goal for zero-downtime upgrades, high
232+ availability, and physical isolation:
243233
244234### Get Builtin Tables Ready for Concurrent Writers
245235
@@ -300,7 +290,7 @@ down on the lame-duck deployment to give the new version room to work.
300290
301291TODO: Figure out how big the impact of the above-mentioned squabbling would be.
302292
303- ### Critical Persist Handles with Concurrent Environmentd Instances
293+ ### Critical Persist Handles for Concurrent Environmentd Instances
304294
305295` StorageCollections ` , the component that is responsible for managing the
306296critical since handles of storage collections, currently has a single critical
@@ -358,7 +348,7 @@ architecture](20231127_pv2_uci_logical_architecture.md), and there is ongoing
358348work towards allowing the adapter to subscribe to catalog changes and apply
359349their implications.
360350
361- ## Work required for Zero-Downtime Upgrades (including DDL )
351+ ## Work required for Zero-Downtime Upgrades (full vision )
362352
363353Beyond the ability for multiple actors to work with the durable catalog, for
364354zero-downtime upgrades with DDL we need the ability for _ multiple versions_ to
@@ -411,17 +401,18 @@ true zero downtime.
411401
412402### True zero-downtime for everything, including DDL
413403
414- Lame-duck mode is a "deviation" from the end goal of true zero-downtime
415- upgrades for all kinds of commands. One could say that we should instead try
416- and deliver that end goal immediately.
404+ One could say that we should skip lame-duck mode and deliver the full vision of
405+ zero-downtime upgrades for all commands immediately.
417406
418- I think the proposal as described above is more feasible in the short term.
419- True zero-downtime upgrades including DDL require us to solve two hard
420- engineering problems:
407+ I think the lame-duck proposal is more feasible in the short term, and it
408+ directly builds toward that goal by establishing the foundation of concurrent
409+ ` environmentd ` instances. True zero-downtime upgrades including DDL require us
410+ to solve two additional hard engineering problems:
421411- subscribing to and applying catalog changes
422412- forward/backward compatibility for the catalog
423413
424- I think these are hard and we should focus on incrementally delivering value.
414+ These are hard problems, and the lame-duck approach lets us deliver value
415+ incrementally while making progress on the shared foundation.
425416
426417## Open Questions
427418
0 commit comments