Skip to content

Add serverless operator, proxy, and TLA+ trace validation#36

Merged
ajroetker merged 25 commits intomainfrom
feature/serverless-operator
Apr 1, 2026
Merged

Add serverless operator, proxy, and TLA+ trace validation#36
ajroetker merged 25 commits intomainfrom
feature/serverless-operator

Conversation

@ajroetker
Copy link
Copy Markdown
Contributor

Summary

  • Serverless proxy (pkg/proxy): New standalone Go submodule implementing a multi-tenant reverse proxy with bearer-token auth, row-level filtering via bleve query injection, dynamic route catalog, and table-first public API normalization. Licensed under Elastic License 2.0.
  • Serverless operator (pkg/operator): New AntflyServerlessProject CRD and controller that reconciles a full project stack (Antfly StatefulSet, Termite Deployment, Proxy Deployment, Services, ConfigMaps, Secrets). Includes validation webhook, generated install bundle, and kustomize examples.
  • Row-level auth: Bleve query filter injection in usermgr so API-key scopes restrict search results at the storage layer.
  • TLA+ trace validation: Build-tag-gated instrumentation (tla_trace) that logs Raft transaction events and a TLA+ spec (specs/tla/TraceAntflyTransaction.tla) verified by TLC against recorded traces. CI workflows for spec checking and trace validation.
  • CI: Proxy container build/push workflow, Go lint/test workflow, TLA+ check and trace workflows.

Test plan

  • pkg/proxy unit tests pass (go test ./pkg/proxy/...)
  • pkg/operator controller and webhook tests pass (cd pkg/operator && go test ./...)
  • src/usermgr row-filter tests pass (go test ./src/usermgr/...)
  • TLA+ spec model-checks cleanly (make tla-check)
  • E2E: deploy serverless project stack via operator example manifests

Also replace the stale Apache 2.0 header on catalog.go with ELv2.
Add NewGatewayFromConfig constructor that accepts custom Authenticator,
Authorizer, and BackendForwarder implementations, enabling Colony to
embed pkg/proxy directly rather than duplicating routing and auth logic.
Implement document-level security filtering across all query paths by
ANDing a security filter query with every request's FilterQuery.

- API keys carry per-table row_filter (map of table→bleve query JSON)
- Casbin p2 named policies store role-based row filters, resolved lazily
  for Basic auth via sync.Once per request
- Proxy layer injects filters into NDJSON and retrieval agent bodies
- usermgr CRUD endpoints for managing row filters per user/role
- Filter injection covers full-text search, vector search, graph queries,
  foreign table SQL WHERE, and document lookup by key
pkg/proxy has zero internal antfly imports (stdlib only), so it can be
its own module for Colony to import without pulling the entire antfly
dependency tree.
Enable etcd/raft's with_tla trace logging by implementing a TraceLogger
in src/tracing/ that emits ndjson via zap. The nop variant compiles out
to nil for zero overhead in default builds.

Add Makefile targets for TLC model checking (tla-check-txn, tla-check-split,
tla-check-snap) and raft trace validation (tla-trace-raft). Helper scripts
handle TLA+ tool downloads and trace preprocessing.
Wire AntflyTraceWriter into the transaction protocol so that builds
with -tags with_tla emit ndjson trace events at each TLA+ action
boundary (InitTransaction, WriteIntent, CommitTransaction, etc.).

Add TraceAntflyTransaction.tla which replays these traces against the
AntflyTransaction spec, checking safety invariants (atomicity, OCC
serialization, no orphaned intents, LWW consistency) on real execution
paths.

Phases implemented:
- Phase 3: src/store/db/tla_trace.go (7 trace helpers), metadata-level
  tracing, traceWriter field on DBImpl and MetadataStore
- Phase 4: TraceAntflyTransaction.tla + .cfg for trace validation,
  tla-trace-txn Makefile target
- Phase 5: CI workflows for model checking (per-PR) and trace
  validation (weekly)

Zero overhead in production builds — all nop stubs compile to empty
functions without the with_tla tag.
ResolveRoute was called with context.Background(), discarding any
context enrichment done by the Authenticator. This caused callers
like Colony's CloudCatalog to miss cached instance info and make
redundant lookups. Add context.Context parameter to Resolve and
ResolveBackend so the enriched request context flows through.
Review of the trace validation pipeline found several critical bugs and
simplification opportunities:

Bugs fixed:
- traceShardID() returned nodeID instead of shardID (wrong return order
  from ParseStorageDBDir)
- Metadata traceCheckPredicates used uuid.String() (dashed) while store
  events used hex.EncodeToString (no dashes), causing TxnID mismatch
- traceRecoveryAbort emitted non-empty ShardID, breaking the TLA+ spec's
  RecoveryAutoAbortIfLogged guard that requires shardId=""
- TraceAntflyTransaction.cfg used hardcoded MC model values (t1, t2, s1,
  s2) that never match real UUID/hex strings in traces
- CI hashFiles('/tmp/...') always returned empty (outside workspace)
- CI -run TestTransaction matched no tests (should be TestHarness_Transaction)
- preprocess_trace sort destroyed chronological order for txn traces
- Missing CHECK_DEADLOCK FALSE in trace cfg

Trace constant derivation:
- All 9 TLA+ constants now derived from trace events (TraceTxns,
  TraceShards, TraceKeys, TraceTxnShards, TraceTxnKeys, TraceTxnReadSet,
  TraceTxnCoord, TraceMaxTimestamp, TraceStalePendingThreshold)
- WriteIntentOnShard/WriteIntentFails events now emit full key data
  (writeKeys, deleteKeys, predicateKeys) instead of just counts
- No MC module or model values needed for trace validation

Simplifications:
- Extracted traceTxnEvent helper (eliminates 7x nil-guard boilerplate)
- Cached traceShardID() result (was re-parsing db.dir every call)
- Replaced encoding/hex with types.FormatKey (project convention)
- ShardID JSON tag changed from omitempty to always-present
- tla-check.yml now uses self-hosted CodeBuild runner
- tla-validate-trace.sh: added -S flag (skip sort), fixed xargs quoting
tla-tools.sh uses bash features (pipefail, local, &>/dev/null) that
fail when sourced from /bin/sh, which is make's default shell.
Move the proxy entrypoint from an antfly subcommand
(cmd/antfly/cmd/proxy.go) to a standalone binary at
pkg/proxy/cmd/antfly-proxy/, matching the termite-proxy pattern.
Update Dockerfile, CI, and docs accordingly.
- Fix Dockerfile.proxy to WORKDIR into the proxy module so go.mod is found
- Remove docker build from Go CI workflow (belongs in container workflow)
- Clean up container workflow (no submodules/Docker Hub needed)
- Trim Go CI path triggers to only pkg/proxy/**
- Suppress gosec G101 false positives on env var and resource names
Restore pkg/operator/ and README.md to their main branch state.
The serverless operator (AntflyServerlessProject CRD, controller,
install bundle, examples) will land in a dedicated PR.
@ajroetker ajroetker merged commit bd004bf into main Apr 1, 2026
5 checks passed
@ajroetker ajroetker deleted the feature/serverless-operator branch April 1, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant