Skip to content

Commit df05e5c

Browse files
committed
update flow, v2
1 parent ebf5c13 commit df05e5c

File tree

1 file changed

+49
-58
lines changed

1 file changed

+49
-58
lines changed

doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_high_availability.md renamed to doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md

Lines changed: 49 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,28 @@
1-
# More Zero-Downtime Upgrades, Physical Isolation, High Availability
2-
3-
We currently do zero-downtime upgrades, where we hydrate an `environmentd` and
4-
its `clusterd` processes at a new version before cutting over to that version.
5-
This makes it so that computation running on clusters is ready when we cut over
6-
and there is no downtime in processing, no lag in freshness. However, due to an
7-
implementation detail of how cutting over from one version to the next works
8-
(see context below) we have an interval of 10–30 seconds where an environment
9-
is not reachable by the user.
10-
11-
This document proposes a design where we can achieve true zero-downtime
12-
upgrades for DML and DQL queries, with the caveat that we still have a window
13-
(about 10-30 seconds) where users cannot issue DDL.
14-
15-
The long-term goal is to remove the caveats from zero-downtime upgrades and to
16-
remove downtime at all moments. The latter is usually tracked under the
17-
separate initiatives of _high availability_ and _physical isolation_. However,
18-
the engineering work that they all require overlaps and they thematically fit
19-
together so we also talk about them briefly below.
20-
21-
The focus of this document is on improving zero-downtime upgrades in the short
22-
term, but I will describe below what the other initiatives entail, what
23-
engineering work is unique to each and what work overlaps, and sketch how we
24-
can achieve the long-term goal.
1+
# Zero-Downtime Upgrades, Physical Isolation, and High Availability
2+
3+
Zero-downtime upgrades, high availability, and physical isolation are related
4+
goals that all require multiple `environmentd` instances to coexist and
5+
collaborate on shared durable state. One key insight connects them:
6+
7+
**Zero-downtime upgrades are high availability across two versions.**
8+
9+
The ultimate goal for zero-downtime upgrades is to have two versions of
10+
Materialize running concurrently, serving traffic with no downtime, where we
11+
can cut over network routes when we're confident the new version is ready. This
12+
also enables us to roll out a new version, observe its behavior, and abandon
13+
the update if something isn't right -- true flexibility in version management.
14+
15+
This document proposes an incremental step toward that vision: a "lame-duck"
16+
upgrade procedure that achieves true zero-downtime for DML and DQL queries.
17+
This is not a detour from the end goal but a direct stepping stone: it requires
18+
us to build the foundational capabilities that all three initiatives need --
19+
running multiple `environmentd` instances with concurrent access to shared
20+
state.
21+
22+
We currently have a brief window (10-30 seconds) where an environment is
23+
unreachable during upgrades (see Context for details). This proposal eliminates
24+
that unreachability for DML and DQL, with DDL seeing a brief window where it
25+
cannot proceed.
2526

2627
## Goals
2728

@@ -105,13 +106,8 @@ availability and zero-downtime upgrades.
105106

106107
## Conceptual Framework: How These Initiatives Relate
107108

108-
Zero-downtime upgrades, high availability, and physical isolation all require
109-
multiple `environmentd` instances to coexist and collaborate on shared durable
110-
state (the catalog). One neat insight is:
111-
112-
**Zero-downtime upgrades are high availability across two versions.**
113-
114-
This framing clarifies what makes each initiative unique:
109+
Given that zero-downtime upgrades are high availability across two versions,
110+
this framing clarifies what makes each initiative unique:
115111

116112
| Initiative | # of Versions | Duration | Primary Challenge |
117113
|------------|---------------|----------|-------------------|
@@ -120,12 +116,14 @@ This framing clarifies what makes each initiative unique:
120116
| physical isolation | 1 (typically) | permanent | workload routing |
121117

122118
All three share a common foundation: the ability to run multiple `environmentd`
123-
instances that can read/write shared state concurrently.
119+
instances that can read/write shared state concurrently. The work proposed in
120+
this document directly builds that foundation, making it a stepping stone toward
121+
the full vision rather than a parallel effort.
124122

125-
## Proposal
123+
## Proposal (lame-duck upgrades)
126124

127-
To reduce downtime during upgrades further, I propose that we change from our
128-
current upgrade procedure to this flow:
125+
As the next incremental milestone for zero-downtime upgrades, I propose that we
126+
change from our current upgrade procedure to this flow:
129127

130128
1. New `environmentd` starts with higher `deploy_generation`/`version`
131129
2. Boots in read-only mode: opens catalog in read-only mode, spawns `clusterd`
@@ -157,25 +155,16 @@ version.
157155

158156
Implicit in this proposal is that we initially still don't want to support DDL
159157
during the upgrade window. In addition to all the proposed work, this would
160-
require two additional large pieces of engineering work that I think are hard
161-
enough that we want to postpone them and instead deliver this incremental
162-
improvement to our upgrade procedure. I will expand on this future work below.
158+
require two additional pieces of engineering work. The lame-duck approach lets
159+
us deliver immediate value while building the foundation those pieces will need.
160+
I expand on this future work below.
163161

164162
The proposed upgrade flow requires a number of changes across different
165163
components. I will sketch these below, but each of the sub-sections will
166164
require a small-ish design document of its own or at the very least a thorough
167165
GitHub issue.
168166

169-
The required work is sectioned into work that is unique to zero-downtime
170-
upgrades, work that is shared with the other initiatives, and then lastly I
171-
will describe the work that is unique to future initiatives. The latter will be
172-
very sparse, because the focus of this document is the more immediate
173-
improvements to zero-downtime upgrades.
174-
175-
## Work required for Zero-Downtime Upgrades (for DML/DQL)
176-
177-
This section and the one about the shared foundation work describe that work
178-
that is required for this proposal.
167+
## Work required for Zero-Downtime Upgrades (lame-duck upgrades)
179168

180169
### Lame-Duck `environmentd` at Old Version
181170

@@ -239,7 +228,8 @@ approach.
239228
## Foundation Work (Required for This Proposal)
240229

241230
The following changes are required for this proposal and form the foundation
242-
for High Availability and Physical Isolation:
231+
for working towards the full goal for zero-downtime upgrades, high
232+
availability, and physical isolation:
243233

244234
### Get Builtin Tables Ready for Concurrent Writers
245235

@@ -300,7 +290,7 @@ down on the lame-duck deployment to give the new version room to work.
300290

301291
TODO: Figure out how big the impact of the above-mentioned squabbling would be.
302292

303-
### Critical Persist Handles with Concurrent Environmentd Instances
293+
### Critical Persist Handles for Concurrent Environmentd Instances
304294

305295
`StorageCollections`, the component that is responsible for managing the
306296
critical since handles of storage collections, currently has a single critical
@@ -358,7 +348,7 @@ architecture](20231127_pv2_uci_logical_architecture.md), and there is ongoing
358348
work towards allowing the adapter to subscribe to catalog changes and apply
359349
their implications.
360350

361-
## Work required for Zero-Downtime Upgrades (including DDL)
351+
## Work required for Zero-Downtime Upgrades (full vision)
362352

363353
Beyond the ability for multiple actors to work with the durable catalog, for
364354
zero-downtime upgrades with DDL we need the ability for _multiple versions_ to
@@ -411,17 +401,18 @@ true zero downtime.
411401

412402
### True zero-downtime for everything, including DDL
413403

414-
Lame-duck mode is a "deviation" from the end goal of true zero-downtime
415-
upgrades for all kinds of commands. One could say that we should instead try
416-
and deliver that end goal immediately.
404+
One could say that we should skip lame-duck mode and deliver the full vision of
405+
zero-downtime upgrades for all commands immediately.
417406

418-
I think the proposal as described above is more feasible in the short term.
419-
True zero-downtime upgrades including DDL require us to solve two hard
420-
engineering problems:
407+
I think the lame-duck proposal is more feasible in the short term, and it
408+
directly builds toward that goal by establishing the foundation of concurrent
409+
`environmentd` instances. True zero-downtime upgrades including DDL require us
410+
to solve two additional hard engineering problems:
421411
- subscribing to and applying catalog changes
422412
- forward/backward compatibility for the catalog
423413

424-
I think these are hard and we should focus on incrementally delivering value.
414+
These are hard problems, and the lame-duck approach lets us deliver value
415+
incrementally while making progress on the shared foundation.
425416

426417
## Open Questions
427418

0 commit comments

Comments
 (0)