[NA] [BE] perf: replace FINAL with LIMIT 1 BY in ExperimentItemDAO refs queries#6621
[NA] [BE] perf: replace FINAL with LIMIT 1 BY in ExperimentItemDAO refs queries#6621YarivHashaiComet wants to merge 3 commits intomainfrom
Conversation
…fs queries Production analysis showed `GET_EXPERIMENT_REFS_BY_SPAN_IDS` running ~2,196 times/hour for a single workspace, averaging ~12s per call and reading ~1.26 TiB/hour. The three sibling queries (GET_EXPERIMENT_REFS_BY_TRACE_IDS, _ITEM_IDS, _SPAN_IDS) all share the same `experiment_items FINAL JOIN experiments FINAL` anti-pattern. Rewrite each as CTEs that explicitly dedup with `ORDER BY last_updated_at DESC` + `LIMIT 1 BY <full sort key>`. This is semantically equivalent to FINAL on a ReplacingMergeTree (the version column is `last_updated_at` for all three tables) and is already used in TraceDAO.find. Why FINAL is slow here: it materializes whole rows of both tables for the merge before WHERE/JOIN apply, even though the predicates are highly selective. The CTE form reads only projected columns and streams over the already-sorted primary key. Validation (workspace 09d8a1ca-…, opik_prod, prod cluster): - TRACE_IDS: 188 rows; FINAL vs rewrite — bit-for-bit identical - ITEM_IDS: 20 rows; FINAL vs rewrite — bit-for-bit identical - SPAN_IDS: 85 rows; FINAL vs rewrite — bit-for-bit identical Timing on the SPAN_IDS slow path (1 warmup + 5 timed runs each): - duration median: FINAL 1034 ms -> rewrite 101 ms (10x) - read_bytes avg: FINAL 228 MiB -> rewrite 33.7 MiB (6.8x less) - memory avg: FINAL 103 MiB -> rewrite 2.5 MiB (41x less) - cpu_time avg: FINAL 823 ms -> rewrite 133 ms (6.2x less) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📋 PR Linter Failed❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the |
|
baz review |
…IMIT 1 BY Same pattern as the three GET_EXPERIMENT_REFS_BY_* constants, on the same hot path (`ExperimentAggregateEventListener.triggerByExperimentIds`). Production sample (last 60 min): 651 calls/hour, avg 11ms per call, total 2.43 GiB read and avg 5.83 MiB memory per call. Empirical comparison on prod (workspace b116eddf-…, 1 warmup + 5 timed runs each, interleaved): - duration: ~7.5 ms -> ~7.2 ms (no meaningful change; small table) - read_bytes: 3.95 MiB -> 1.47 MiB (2.7x less) - memory_usage: ~3.9 MiB -> ~33 KiB (~120x less) - cpu_time: ~11.2k us -> ~6.4k us (~1.7x less) - read_rows / marks / parts: identical (same physical scan) Result equivalence verified (bit-for-bit identical via sorted-TSV diff). Latency win is negligible because `experiments` is small, but the memory and bytes-read drops still pay back at 651 calls/hour, and consistency matters: this DAO already uses LIMIT 1 BY 14 times, so this brings the last FINAL outlier in line with the rest of the file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The STREAM constant (line 115) had one remaining `FROM spans final` inside the LEFT JOIN that aggregates per-trace total_estimated_cost / usage. Same anti-pattern as the four constants already rewritten, so apply the same fix here for consistency and to land the perf wins on the export path. Empirical comparison on prod (workspace 09d8a1ca-…, 200 trace_ids; 1 warmup + 5 timed runs each, interleaved): - duration: ~192 ms -> ~63 ms (~3.0x faster) - read_rows: ~410 k -> ~196 k (~2.1x less; FINAL re-reads dupes) - read_bytes: ~136 MiB -> ~15.7 MiB (~8.7x less) - memory_usage: ~79 MiB -> ~38 MiB (~2.1x less) - cpu_time: ~500k us -> ~93k us (~5.4x less) - marks/parts: identical (~1060 / ~135) — same physical scan width Result equivalence verified — 200 rows, bit-for-bit identical via sorted-TSV diff. Prod volume on this query is low (~2 calls / 6 hours by log_comment `get_experiment_items_stream`, used as an export endpoint), so this isn't a frequency hotspot — but per-call cost is dominant for the endpoint and improvements scale with experiment size. After this commit, no `FINAL` SQL operators remain in this DAO; the only remaining mention is the explanatory comment on the rewrite pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
We shouldn't be blindly changing all those queries. We checked, and there seem to be some scenarios where the FINAL performs better than ORDER BY ... LIMIT 1 BY... . Here is an example: #6567 (comment) . Worth testing on large workspaces. Not just this, but the |
thiagohora
left a comment
There was a problem hiding this comment.
All sortable keys are not correct. Also, please verify on larger workspaces with significant experiment volume.
Details
ExperimentItemDAOhad five places usingFROM <table> FINALagainstReplacingMergeTreetables: four whole SQL constants (GET_EXPERIMENT_REFS_BY_TRACE_IDS,GET_EXPERIMENT_REFS_BY_ITEM_IDS,GET_EXPERIMENT_REFS_BY_SPAN_IDS,FILTER_EXPERIMENT_IDS_BY_STATUS), plus oneFROM spans finalaggregation inside theSTREAMconstant. Production query-log analysis showed this pattern dominates ClickHouse load — the SPAN_IDS variant alone runs ~2.2k calls/hour averaging ~12 s each, scanning ~1.26 TiB/hour and ~7B rows. TheFILTER_*variant runs another 651 calls/hour from the sameExperimentAggregateEventListenerhot path. TheSTREAMexport path runs at lower frequency but has a large per-call footprint.This PR rewrites all five sites to dedup explicitly via
ORDER BY last_updated_at DESC+LIMIT 1 BY <full sort key>inside CTEs/subqueries. That is the documented FINAL-equivalent forReplacingMergeTree(the version column islast_updated_atforexperiment_items,experiments, andspans).After this PR, no
FINALSQL operators remain in this DAO — only the explanatory comment about the rewrite pattern.FINALmaterializes whole rows for the per-part merge before WHERE/JOIN predicates can filter; the rewrite reads only the projected columns and streams over already-sorted primary key.statusfilter is correctly kept afterea_dedupbecausestatusis not part of theexperimentssort key — pushing it before dedup would change semantics.getExperimentRefsByIds,filterExperimentIdsByStatus, the streamingfind) are untouched.Precedent — same pattern is already standard in this codebase
LIMIT 1 BY+ORDER BY ... last_updated_at DESCis used 207 times across 16 DAO files:The file we are editing already uses this pattern 14 times — the five FINAL sites were the outliers, not the rewrite.
The closest precedent is
DatasetItemVersionDAO.java:421-431, which dedups the exact same tables (experiments,experiment_items) with the exact same sort keys used here:Change checklist
Issues
NA — no Jira ticket; performance fix derived from production ClickHouse query-log analysis.
AI-WATERMARK
AI-WATERMARK: yes
Testing
Result equivalence — validated against
opik_prodClickHouse cluster using real sampled IDs from a high-volume workspace, comparing original FINAL vs rewrite at every site:GET_EXPERIMENT_REFS_BY_TRACE_IDSGET_EXPERIMENT_REFS_BY_ITEM_IDSGET_EXPERIMENT_REFS_BY_SPAN_IDSFILTER_EXPERIMENT_IDS_BY_STATUSSTREAMspans aggregationBit-for-bit equivalence confirmed via sorted-TSV diff (
diff <(orig) <(rewrite)empty for all five).Timing —
GET_EXPERIMENT_REFS_BY_SPAN_IDS(the prod-slow variant; 1 warmup + 5 timed runs each, interleaved)Timing —
FILTER_EXPERIMENT_IDS_BY_STATUS(small table; 1 warmup + 5 timed runs each)Timing —
STREAMspans aggregation (200 trace_ids; 1 warmup + 5 timed runs each)Memory drop is the headline number across all variants — it directly raises how many of these can run concurrently without cluster pressure.
Local commands run:
cd apps/opik-backend && mvn -q spotless:check -pl .— passesCI to run automatically:
backend_formatting_check.yml(spotless)backend_tests.yml—ExperimentItemProcessorTestandExperimentAggregateEventListenerTestexercise these paths; output is unchanged so existing tests pass without modificationDocumentation
N/A — internal SQL refactor with no public API or behavior change.