Add serverless operator, proxy, and TLA+ trace validation#36
Merged
Add serverless operator, proxy, and TLA+ trace validation#36
Conversation
Also replace the stale Apache 2.0 header on catalog.go with ELv2.
Add NewGatewayFromConfig constructor that accepts custom Authenticator, Authorizer, and BackendForwarder implementations, enabling Colony to embed pkg/proxy directly rather than duplicating routing and auth logic.
Implement document-level security filtering across all query paths by ANDing a security filter query with every request's FilterQuery. - API keys carry per-table row_filter (map of table→bleve query JSON) - Casbin p2 named policies store role-based row filters, resolved lazily for Basic auth via sync.Once per request - Proxy layer injects filters into NDJSON and retrieval agent bodies - usermgr CRUD endpoints for managing row filters per user/role - Filter injection covers full-text search, vector search, graph queries, foreign table SQL WHERE, and document lookup by key
pkg/proxy has zero internal antfly imports (stdlib only), so it can be its own module for Colony to import without pulling the entire antfly dependency tree.
Enable etcd/raft's with_tla trace logging by implementing a TraceLogger in src/tracing/ that emits ndjson via zap. The nop variant compiles out to nil for zero overhead in default builds. Add Makefile targets for TLC model checking (tla-check-txn, tla-check-split, tla-check-snap) and raft trace validation (tla-trace-raft). Helper scripts handle TLA+ tool downloads and trace preprocessing.
Wire AntflyTraceWriter into the transaction protocol so that builds with -tags with_tla emit ndjson trace events at each TLA+ action boundary (InitTransaction, WriteIntent, CommitTransaction, etc.). Add TraceAntflyTransaction.tla which replays these traces against the AntflyTransaction spec, checking safety invariants (atomicity, OCC serialization, no orphaned intents, LWW consistency) on real execution paths. Phases implemented: - Phase 3: src/store/db/tla_trace.go (7 trace helpers), metadata-level tracing, traceWriter field on DBImpl and MetadataStore - Phase 4: TraceAntflyTransaction.tla + .cfg for trace validation, tla-trace-txn Makefile target - Phase 5: CI workflows for model checking (per-PR) and trace validation (weekly) Zero overhead in production builds — all nop stubs compile to empty functions without the with_tla tag.
ResolveRoute was called with context.Background(), discarding any context enrichment done by the Authenticator. This caused callers like Colony's CloudCatalog to miss cached instance info and make redundant lookups. Add context.Context parameter to Resolve and ResolveBackend so the enriched request context flows through.
Review of the trace validation pipeline found several critical bugs and
simplification opportunities:
Bugs fixed:
- traceShardID() returned nodeID instead of shardID (wrong return order
from ParseStorageDBDir)
- Metadata traceCheckPredicates used uuid.String() (dashed) while store
events used hex.EncodeToString (no dashes), causing TxnID mismatch
- traceRecoveryAbort emitted non-empty ShardID, breaking the TLA+ spec's
RecoveryAutoAbortIfLogged guard that requires shardId=""
- TraceAntflyTransaction.cfg used hardcoded MC model values (t1, t2, s1,
s2) that never match real UUID/hex strings in traces
- CI hashFiles('/tmp/...') always returned empty (outside workspace)
- CI -run TestTransaction matched no tests (should be TestHarness_Transaction)
- preprocess_trace sort destroyed chronological order for txn traces
- Missing CHECK_DEADLOCK FALSE in trace cfg
Trace constant derivation:
- All 9 TLA+ constants now derived from trace events (TraceTxns,
TraceShards, TraceKeys, TraceTxnShards, TraceTxnKeys, TraceTxnReadSet,
TraceTxnCoord, TraceMaxTimestamp, TraceStalePendingThreshold)
- WriteIntentOnShard/WriteIntentFails events now emit full key data
(writeKeys, deleteKeys, predicateKeys) instead of just counts
- No MC module or model values needed for trace validation
Simplifications:
- Extracted traceTxnEvent helper (eliminates 7x nil-guard boilerplate)
- Cached traceShardID() result (was re-parsing db.dir every call)
- Replaced encoding/hex with types.FormatKey (project convention)
- ShardID JSON tag changed from omitempty to always-present
- tla-check.yml now uses self-hosted CodeBuild runner
- tla-validate-trace.sh: added -S flag (skip sort), fixed xargs quoting
tla-tools.sh uses bash features (pipefail, local, &>/dev/null) that fail when sourced from /bin/sh, which is make's default shell.
Move the proxy entrypoint from an antfly subcommand (cmd/antfly/cmd/proxy.go) to a standalone binary at pkg/proxy/cmd/antfly-proxy/, matching the termite-proxy pattern. Update Dockerfile, CI, and docs accordingly.
- Fix Dockerfile.proxy to WORKDIR into the proxy module so go.mod is found - Remove docker build from Go CI workflow (belongs in container workflow) - Clean up container workflow (no submodules/Docker Hub needed) - Trim Go CI path triggers to only pkg/proxy/** - Suppress gosec G101 false positives on env var and resource names
Restore pkg/operator/ and README.md to their main branch state. The serverless operator (AntflyServerlessProject CRD, controller, install bundle, examples) will land in a dedicated PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pkg/proxy): New standalone Go submodule implementing a multi-tenant reverse proxy with bearer-token auth, row-level filtering via bleve query injection, dynamic route catalog, and table-first public API normalization. Licensed under Elastic License 2.0.pkg/operator): NewAntflyServerlessProjectCRD and controller that reconciles a full project stack (Antfly StatefulSet, Termite Deployment, Proxy Deployment, Services, ConfigMaps, Secrets). Includes validation webhook, generated install bundle, and kustomize examples.usermgrso API-key scopes restrict search results at the storage layer.tla_trace) that logs Raft transaction events and a TLA+ spec (specs/tla/TraceAntflyTransaction.tla) verified by TLC against recorded traces. CI workflows for spec checking and trace validation.Test plan
pkg/proxyunit tests pass (go test ./pkg/proxy/...)pkg/operatorcontroller and webhook tests pass (cd pkg/operator && go test ./...)src/usermgrrow-filter tests pass (go test ./src/usermgr/...)make tla-check)