[NA] [BE] perf: replace FINAL with LIMIT 1 BY in ExperimentItemDAO refs queries by YarivHashaiComet · Pull Request #6621 · comet-ml/opik

YarivHashaiComet · 2026-05-06T14:51:05Z

Details

ExperimentItemDAO had five places using FROM <table> FINAL against ReplacingMergeTree tables: four whole SQL constants (GET_EXPERIMENT_REFS_BY_TRACE_IDS, GET_EXPERIMENT_REFS_BY_ITEM_IDS, GET_EXPERIMENT_REFS_BY_SPAN_IDS, FILTER_EXPERIMENT_IDS_BY_STATUS), plus one FROM spans final aggregation inside the STREAM constant. Production query-log analysis showed this pattern dominates ClickHouse load — the SPAN_IDS variant alone runs ~2.2k calls/hour averaging ~12 s each, scanning ~1.26 TiB/hour and ~7B rows. The FILTER_* variant runs another 651 calls/hour from the same ExperimentAggregateEventListener hot path. The STREAM export path runs at lower frequency but has a large per-call footprint.

This PR rewrites all five sites to dedup explicitly via ORDER BY last_updated_at DESC + LIMIT 1 BY <full sort key> inside CTEs/subqueries. That is the documented FINAL-equivalent for ReplacingMergeTree (the version column is last_updated_at for experiment_items, experiments, and spans).

After this PR, no FINAL SQL operators remain in this DAO — only the explanatory comment about the rewrite pattern.

FINAL materializes whole rows for the per-part merge before WHERE/JOIN predicates can filter; the rewrite reads only the projected columns and streams over already-sorted primary key.
The status filter is correctly kept after ea_dedup because status is not part of the experiments sort key — pushing it before dedup would change semantics.
Bind-parameter signatures unchanged for all call sites; Java callers (getExperimentRefsByIds, filterExperimentIdsByStatus, the streaming find) are untouched.

Precedent — same pattern is already standard in this codebase

LIMIT 1 BY + ORDER BY ... last_updated_at DESC is used 207 times across 16 DAO files:

file	occurrences
ExperimentDAO.java	27
DatasetItemVersionDAO.java	30
TraceDAO.java	25
ThreadDAO.java	22
DatasetItemDAO.java	16
ExperimentAggregatesDAO.java	14
ExperimentItemDAO.java (this file)	14
SpanDAO.java	11
KpiCardDAO.java, OptimizationDAO.java, ProjectMetricsDAO.java	10 each
FeedbackScoreDAO.java	7
AnnotationQueueDAO.java, TraceThreadDAO.java	3–4
AttachmentDAO.java, CommentDAO.java	2 each

The file we are editing already uses this pattern 14 times — the five FINAL sites were the outliers, not the rewrite.

The closest precedent is DatasetItemVersionDAO.java:421-431, which dedups the exact same tables (experiments, experiment_items) with the exact same sort keys used here:

-- experiments dedup
ORDER BY (workspace_id, dataset_id, id) DESC, last_updated_at DESC
LIMIT 1 BY id

-- experiment_items dedup
ORDER BY (ei.workspace_id, ei.experiment_id, ei.dataset_item_id, ei.trace_id, ei.id) DESC, ei.last_updated_at DESC
LIMIT 1 BY ei.id

Change checklist

User facing
Documentation update

Issues

NA — no Jira ticket; performance fix derived from production ClickHouse query-log analysis.

AI-WATERMARK

AI-WATERMARK: yes

Tools: Claude Code
Model(s): Claude Opus 4.7
Scope: full implementation (SQL rewrite + empirical validation across all five sites)
Human verification: production result-equivalence checks, interleaved timing runs, code review of the diff

Testing

Result equivalence — validated against opik_prod ClickHouse cluster using real sampled IDs from a high-volume workspace, comparing original FINAL vs rewrite at every site:

site	rows returned (FINAL)	rows returned (rewrite)	identical
`GET_EXPERIMENT_REFS_BY_TRACE_IDS`	188	188	yes
`GET_EXPERIMENT_REFS_BY_ITEM_IDS`	20	20	yes
`GET_EXPERIMENT_REFS_BY_SPAN_IDS`	85	85	yes
`FILTER_EXPERIMENT_IDS_BY_STATUS`	1	1	yes
`STREAM` spans aggregation	200	200	yes

Bit-for-bit equivalence confirmed via sorted-TSV diff (diff <(orig) <(rewrite) empty for all five).

Timing — `GET_EXPERIMENT_REFS_BY_SPAN_IDS` (the prod-slow variant; 1 warmup + 5 timed runs each, interleaved)

metric	FINAL	LIMIT 1 BY	improvement
duration (median)	1,034 ms	101 ms	10.2x faster
read_bytes (avg)	228 MiB	33.7 MiB	6.8x less
memory_usage (avg)	103 MiB	2.5 MiB	41x less
cpu_time (avg)	823 ms	133 ms	6.2x less
marks scanned	~9,485	~9,479	same scan width

Timing — `FILTER_EXPERIMENT_IDS_BY_STATUS` (small table; 1 warmup + 5 timed runs each)

metric	FINAL	LIMIT 1 BY	improvement
duration (mean)	~7.5 ms	~7.2 ms	no change (table is small)
read_bytes (avg)	3.95 MiB	1.47 MiB	2.7x less
memory_usage (avg)	~3.9 MiB	~33 KiB	~120x less
cpu_time (avg)	~11.2k µs	~6.4k µs	1.7x less

Timing — `STREAM` spans aggregation (200 trace_ids; 1 warmup + 5 timed runs each)

metric	FINAL	LIMIT 1 BY	improvement
duration (avg)	~192 ms	~63 ms	~3.0x faster
read_rows (avg)	~410k	~196k	~2.1x less (FINAL re-reads dupes)
read_bytes (avg)	~136 MiB	~15.7 MiB	~8.7x less
memory_usage (avg)	~79 MiB	~38 MiB	~2.1x less
cpu_time (avg)	~500k µs	~93k µs	~5.4x less
marks/parts	~1,060 / ~135	identical	same scan width

Memory drop is the headline number across all variants — it directly raises how many of these can run concurrently without cluster pressure.

Local commands run:

cd apps/opik-backend && mvn -q spotless:check -pl . — passes

CI to run automatically:

backend_formatting_check.yml (spotless)
backend_tests.yml — ExperimentItemProcessorTest and ExperimentAggregateEventListenerTest exercise these paths; output is unchanged so existing tests pass without modification

Documentation

N/A — internal SQL refactor with no public API or behavior change.

…fs queries Production analysis showed `GET_EXPERIMENT_REFS_BY_SPAN_IDS` running ~2,196 times/hour for a single workspace, averaging ~12s per call and reading ~1.26 TiB/hour. The three sibling queries (GET_EXPERIMENT_REFS_BY_TRACE_IDS, _ITEM_IDS, _SPAN_IDS) all share the same `experiment_items FINAL JOIN experiments FINAL` anti-pattern. Rewrite each as CTEs that explicitly dedup with `ORDER BY last_updated_at DESC` + `LIMIT 1 BY <full sort key>`. This is semantically equivalent to FINAL on a ReplacingMergeTree (the version column is `last_updated_at` for all three tables) and is already used in TraceDAO.find. Why FINAL is slow here: it materializes whole rows of both tables for the merge before WHERE/JOIN apply, even though the predicates are highly selective. The CTE form reads only projected columns and streams over the already-sorted primary key. Validation (workspace 09d8a1ca-…, opik_prod, prod cluster): - TRACE_IDS: 188 rows; FINAL vs rewrite — bit-for-bit identical - ITEM_IDS: 20 rows; FINAL vs rewrite — bit-for-bit identical - SPAN_IDS: 85 rows; FINAL vs rewrite — bit-for-bit identical Timing on the SPAN_IDS slow path (1 warmup + 5 timed runs each): - duration median: FINAL 1034 ms -> rewrite 101 ms (10x) - read_bytes avg: FINAL 228 MiB -> rewrite 33.7 MiB (6.8x less) - memory avg: FINAL 103 MiB -> rewrite 2.5 MiB (41x less) - cpu_time avg: FINAL 823 ms -> rewrite 133 ms (6.2x less) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-06T14:51:30Z

📋 PR Linter Failed

❌ Missing Section. The description is missing the ## Details section.

❌ Missing Section. The description is missing the ## Change checklist section.

❌ Missing Section. The description is missing the ## Issues section.

❌ Missing Section. The description is missing the ## Testing section.

❌ Missing Section. The description is missing the ## Documentation section.

YarivHashaiComet · 2026-05-06T15:08:51Z

baz review

…IMIT 1 BY Same pattern as the three GET_EXPERIMENT_REFS_BY_* constants, on the same hot path (`ExperimentAggregateEventListener.triggerByExperimentIds`). Production sample (last 60 min): 651 calls/hour, avg 11ms per call, total 2.43 GiB read and avg 5.83 MiB memory per call. Empirical comparison on prod (workspace b116eddf-…, 1 warmup + 5 timed runs each, interleaved): - duration: ~7.5 ms -> ~7.2 ms (no meaningful change; small table) - read_bytes: 3.95 MiB -> 1.47 MiB (2.7x less) - memory_usage: ~3.9 MiB -> ~33 KiB (~120x less) - cpu_time: ~11.2k us -> ~6.4k us (~1.7x less) - read_rows / marks / parts: identical (same physical scan) Result equivalence verified (bit-for-bit identical via sorted-TSV diff). Latency win is negligible because `experiments` is small, but the memory and bytes-read drops still pay back at 651 calls/hour, and consistency matters: this DAO already uses LIMIT 1 BY 14 times, so this brings the last FINAL outlier in line with the rest of the file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The STREAM constant (line 115) had one remaining `FROM spans final` inside the LEFT JOIN that aggregates per-trace total_estimated_cost / usage. Same anti-pattern as the four constants already rewritten, so apply the same fix here for consistency and to land the perf wins on the export path. Empirical comparison on prod (workspace 09d8a1ca-…, 200 trace_ids; 1 warmup + 5 timed runs each, interleaved): - duration: ~192 ms -> ~63 ms (~3.0x faster) - read_rows: ~410 k -> ~196 k (~2.1x less; FINAL re-reads dupes) - read_bytes: ~136 MiB -> ~15.7 MiB (~8.7x less) - memory_usage: ~79 MiB -> ~38 MiB (~2.1x less) - cpu_time: ~500k us -> ~93k us (~5.4x less) - marks/parts: identical (~1060 / ~135) — same physical scan width Result equivalence verified — 200 rows, bit-for-bit identical via sorted-TSV diff. Prod volume on this query is low (~2 calls / 6 hours by log_comment `get_experiment_items_stream`, used as an export endpoint), so this isn't a frequency hotspot — but per-call cost is dominant for the endpoint and improvements scale with experiment size. After this commit, no `FINAL` SQL operators remain in this DAO; the only remaining mention is the explanatory comment on the rewrite pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thiagohora · 2026-05-06T15:48:04Z

We shouldn't be blindly changing all those queries. We checked, and there seem to be some scenarios where the FINAL performs better than ORDER BY ... LIMIT 1 BY... . Here is an example: #6567 (comment) . Worth testing on large workspaces.

Not just this, but the ORDER BY only optimizes if the prefix of the sorting is exactly the same as the table sortable key.

thiagohora

All sortable keys are not correct. Also, please verify on larger workspaces with significant experiment volume.

github-actions Bot assigned YarivHashaiComet May 6, 2026

github-actions Bot added java Pull requests that update Java code Backend labels May 6, 2026

YarivHashaiComet and others added 2 commits May 6, 2026 18:14

thiagohora requested changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NA] [BE] perf: replace FINAL with LIMIT 1 BY in ExperimentItemDAO refs queries#6621

[NA] [BE] perf: replace FINAL with LIMIT 1 BY in ExperimentItemDAO refs queries#6621
YarivHashaiComet wants to merge 3 commits intomainfrom
YarivHashaiComet/perf-experiment-refs-final-to-limit-1-by

YarivHashaiComet commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

YarivHashaiComet commented May 6, 2026

Uh oh!

thiagohora commented May 6, 2026 •

edited

Loading

Uh oh!

thiagohora left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YarivHashaiComet commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Precedent — same pattern is already standard in this codebase

Change checklist

Issues

AI-WATERMARK

Testing

Timing — GET_EXPERIMENT_REFS_BY_SPAN_IDS (the prod-slow variant; 1 warmup + 5 timed runs each, interleaved)

Timing — FILTER_EXPERIMENT_IDS_BY_STATUS (small table; 1 warmup + 5 timed runs each)

Timing — STREAM spans aggregation (200 trace_ids; 1 warmup + 5 timed runs each)

Documentation

Uh oh!

github-actions Bot commented May 6, 2026

📋 PR Linter Failed

Uh oh!

YarivHashaiComet commented May 6, 2026

Uh oh!

thiagohora commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thiagohora left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YarivHashaiComet commented May 6, 2026 •

edited

Loading

Timing — `GET_EXPERIMENT_REFS_BY_SPAN_IDS` (the prod-slow variant; 1 warmup + 5 timed runs each, interleaved)

Timing — `FILTER_EXPERIMENT_IDS_BY_STATUS` (small table; 1 warmup + 5 timed runs each)

Timing — `STREAM` spans aggregation (200 trace_ids; 1 warmup + 5 timed runs each)

thiagohora commented May 6, 2026 •

edited

Loading