Skip to content

persist: Add Criterion benchmarks for envelope encryption overhead#35338

Closed
jasonhernandez wants to merge 11 commits intomainfrom
kms-envelope-encryption-consensus-svc
Closed

persist: Add Criterion benchmarks for envelope encryption overhead#35338
jasonhernandez wants to merge 11 commits intomainfrom
kms-envelope-encryption-consensus-svc

Conversation

@jasonhernandez
Copy link
Contributor

Summary

  • Adds Criterion micro-benchmarks for the hot-path AES-256-GCM crypto functions (encrypt_with_dek, decrypt_with_key, parse_envelope) across 4 payload sizes (256B, 4KiB, 64KiB, 1MiB)
  • Introduces a lib.rs to expose crypto internals to the benchmark harness, following the persist-client Criterion pattern
  • Confirms the <5µs claim for typical WAL batch sizes: 4KiB encrypt ~3.3µs, decrypt ~0.8µs, full roundtrip ~3.8µs

Benchmark Results

crypto/encrypt/256B     time: [2.92 µs]    thrpt: [83 MiB/s]
crypto/encrypt/4KiB     time: [3.33 µs]    thrpt: [1.14 GiB/s]
crypto/encrypt/64KiB    time: [14.5 µs]    thrpt: [4.2 GiB/s]
crypto/encrypt/1MiB     time: [143 µs]     thrpt: [6.8 GiB/s]

crypto/decrypt/256B     time: [226 ns]     thrpt: [1.1 GiB/s]
crypto/decrypt/4KiB     time: [813 ns]     thrpt: [4.7 GiB/s]
crypto/decrypt/64KiB    time: [11.9 µs]    thrpt: [5.1 GiB/s]
crypto/decrypt/1MiB     time: [144 µs]     thrpt: [6.8 GiB/s]

crypto/roundtrip/256B   time: [2.70 µs]    thrpt: [90 MiB/s]
crypto/roundtrip/4KiB   time: [3.83 µs]    thrpt: [1.0 GiB/s]
crypto/roundtrip/64KiB  time: [26.2 µs]    thrpt: [2.3 GiB/s]
crypto/roundtrip/1MiB   time: [325 µs]     thrpt: [3.0 GiB/s]

Test plan

  • cargo check -p mz-persist-consensus-svc --benches compiles
  • cargo test -p mz-persist-consensus-svc — all 34 tests pass
  • cargo bench -p mz-persist-consensus-svc — all 12 benchmarks produce results
  • Throughput numbers confirm <5µs for typical 4KiB WAL batches

🤖 Generated with Claude Code

pH14 and others added 11 commits March 5, 2026 08:01
Introduces two new components to batch independent cross-shard CAS writes
into a single durable S3 Express One Zone PUT per flush interval, making
cost O(1/batch_window) instead of O(shards):

- `RpcConsensus` (gRPC client implementing the `Consensus` trait)
- `persist-consensus-svc` (group commit service with actor-based state
  machine, WAL, snapshot, and recovery)

The actor processes CAS, truncate, scan, head, and list_keys commands on
a single thread. Writes are batched and flushed to S3 WAL periodically,
with snapshots every N batches for bounded recovery.

Includes 27 unit tests covering CAS semantics, group commit batching,
read operations, truncate, WAL integration, and snapshot intervals.
- Switch S3 client to mz_aws_util for proper HTTP client and
  virtual-hosted-style addressing (fixes S3 Express compatibility)
- Wrap runtime in LocalSet for spawn_local support
- Add info-level request logging to gRPC handlers and flush
- Replace explicit test shutdown calls with Drop-based cleanup
- Change default flush interval to 20ms, listen port to 6890
- Fix pyactivate to pin Python 3.13 for confluent-kafka wheel compat
- Prometheus metrics for the consensus service: operation counters
  (CAS committed/rejected, head, scan, truncate), S3 write counters
  and latency histograms (power-of-2 buckets from 1ms to 5s), flush
  histograms (ops/batch, shards/batch, latency), in-memory state
  gauges (active shards, entries, bytes). Served via axum HTTP on
  port 6891.

- Fix S3 retry logic: introduce WalWriteError enum that distinguishes
  Failed from AlreadyExists (412 PreconditionFailed /
  ConditionalRequestConflict). On retry, AlreadyExists means the
  original write landed — treat as success instead of error.

- Grafana dashboard (auto-provisioned): 6 panels showing S3 PUTs/s
  (the money chart — stays flat), active shards, ops/s by type,
  S3 PUT latency p50/p99, ops per batch, and in-memory state.

- Prometheus scrape config for the consensus service.

- Demo harness: demo.sh creates a PG source with background inserts,
  then stages 5→20→50→100 materialized views to show shards scaling
  while S3 writes stay flat. cleanup.sh tears it all down.
…noise

- Set active_shards/total_entries/approx_bytes gauges in Actor::new()
  so metrics reflect recovered state immediately instead of showing
  zeros until the first flush.
- Rewrite demo.sh to be fully self-contained: auto-detects and starts
  Postgres (brew 14-18), creates database/table/publication, and
  cleans up on exit.
- Downgrade per-request gRPC handler logging from info to debug to
  reduce noise under load.
- Split S3 latency into separate WAL (p50/p99/p99.99) and Snapshot
  panels; add S3 bytes written/s panel; anchor S3 Writes/s y-axis at 0
- Move Grafana from port 3000 to 3001 to free port for Console
- Infinite retry with exponential backoff on WAL writes — only Ok and
  AlreadyExists are definite results, transient failures never
  propagate to clients
- Add 10-second stats heartbeat log to actor for demo visibility
- Demo: single-row UPDATE at 20Hz instead of INSERTs; trivial MVs
  (upper(val::text)); scale to 200 MVs; clean up stale PG state
- Add one-pager and demo talk track documentation
Reword "The Insight" to "The Approach", remove mid-paragraph bolding,
add in-memory serving note, add S3 prefix isolation, add "What We Did
Not Build" section, add "Wait a second..." closing.
Add optional AES-256-GCM envelope encryption for all S3 WAL batches and
snapshots. A KMS-derived Data Encryption Key (DEK) is cached in memory
for fast local encryption; the KMS-wrapped copy is stored per-object so
each object is self-contained for decryption. DEK rotates in the
background on a configurable interval.

Enabled via --kms-key-id; when unset, data passes through unencrypted
(backward compatible). Decrypt uses a fast path when the wrapped DEK
matches the cached key, avoiding KMS calls during normal operation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add micro-benchmarks for the hot-path AES-256-GCM crypto functions
(encrypt_with_dek, decrypt_with_key, parse_envelope) to quantify
encryption overhead and catch regressions. Benchmarks cover 256B,
4KiB, 64KiB, and 1MiB payloads — confirming <5µs for typical WAL
batch sizes (4KiB encrypt ~3.3µs, roundtrip ~3.8µs).

Introduces a lib.rs to expose crypto internals to the benchmark
harness, following the persist-client Criterion pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants