Skip to content

[NA] [BE] perf: replace FINAL with LIMIT 1 BY in ExperimentItemDAO refs queries#6621

Draft
YarivHashaiComet wants to merge 3 commits intomainfrom
YarivHashaiComet/perf-experiment-refs-final-to-limit-1-by
Draft

[NA] [BE] perf: replace FINAL with LIMIT 1 BY in ExperimentItemDAO refs queries#6621
YarivHashaiComet wants to merge 3 commits intomainfrom
YarivHashaiComet/perf-experiment-refs-final-to-limit-1-by

Conversation

@YarivHashaiComet
Copy link
Copy Markdown
Collaborator

@YarivHashaiComet YarivHashaiComet commented May 6, 2026

Details

ExperimentItemDAO had five places using FROM <table> FINAL against ReplacingMergeTree tables: four whole SQL constants (GET_EXPERIMENT_REFS_BY_TRACE_IDS, GET_EXPERIMENT_REFS_BY_ITEM_IDS, GET_EXPERIMENT_REFS_BY_SPAN_IDS, FILTER_EXPERIMENT_IDS_BY_STATUS), plus one FROM spans final aggregation inside the STREAM constant. Production query-log analysis showed this pattern dominates ClickHouse load — the SPAN_IDS variant alone runs ~2.2k calls/hour averaging ~12 s each, scanning ~1.26 TiB/hour and ~7B rows. The FILTER_* variant runs another 651 calls/hour from the same ExperimentAggregateEventListener hot path. The STREAM export path runs at lower frequency but has a large per-call footprint.

This PR rewrites all five sites to dedup explicitly via ORDER BY last_updated_at DESC + LIMIT 1 BY <full sort key> inside CTEs/subqueries. That is the documented FINAL-equivalent for ReplacingMergeTree (the version column is last_updated_at for experiment_items, experiments, and spans).

After this PR, no FINAL SQL operators remain in this DAO — only the explanatory comment about the rewrite pattern.

  • FINAL materializes whole rows for the per-part merge before WHERE/JOIN predicates can filter; the rewrite reads only the projected columns and streams over already-sorted primary key.
  • The status filter is correctly kept after ea_dedup because status is not part of the experiments sort key — pushing it before dedup would change semantics.
  • Bind-parameter signatures unchanged for all call sites; Java callers (getExperimentRefsByIds, filterExperimentIdsByStatus, the streaming find) are untouched.

Precedent — same pattern is already standard in this codebase

LIMIT 1 BY + ORDER BY ... last_updated_at DESC is used 207 times across 16 DAO files:

file occurrences
ExperimentDAO.java 27
DatasetItemVersionDAO.java 30
TraceDAO.java 25
ThreadDAO.java 22
DatasetItemDAO.java 16
ExperimentAggregatesDAO.java 14
ExperimentItemDAO.java (this file) 14
SpanDAO.java 11
KpiCardDAO.java, OptimizationDAO.java, ProjectMetricsDAO.java 10 each
FeedbackScoreDAO.java 7
AnnotationQueueDAO.java, TraceThreadDAO.java 3–4
AttachmentDAO.java, CommentDAO.java 2 each

The file we are editing already uses this pattern 14 times — the five FINAL sites were the outliers, not the rewrite.

The closest precedent is DatasetItemVersionDAO.java:421-431, which dedups the exact same tables (experiments, experiment_items) with the exact same sort keys used here:

-- experiments dedup
ORDER BY (workspace_id, dataset_id, id) DESC, last_updated_at DESC
LIMIT 1 BY id

-- experiment_items dedup
ORDER BY (ei.workspace_id, ei.experiment_id, ei.dataset_item_id, ei.trace_id, ei.id) DESC, ei.last_updated_at DESC
LIMIT 1 BY ei.id

Change checklist

  • User facing
  • Documentation update

Issues

NA — no Jira ticket; performance fix derived from production ClickHouse query-log analysis.

AI-WATERMARK

AI-WATERMARK: yes

  • Tools: Claude Code
  • Model(s): Claude Opus 4.7
  • Scope: full implementation (SQL rewrite + empirical validation across all five sites)
  • Human verification: production result-equivalence checks, interleaved timing runs, code review of the diff

Testing

Result equivalence — validated against opik_prod ClickHouse cluster using real sampled IDs from a high-volume workspace, comparing original FINAL vs rewrite at every site:

site rows returned (FINAL) rows returned (rewrite) identical
GET_EXPERIMENT_REFS_BY_TRACE_IDS 188 188 yes
GET_EXPERIMENT_REFS_BY_ITEM_IDS 20 20 yes
GET_EXPERIMENT_REFS_BY_SPAN_IDS 85 85 yes
FILTER_EXPERIMENT_IDS_BY_STATUS 1 1 yes
STREAM spans aggregation 200 200 yes

Bit-for-bit equivalence confirmed via sorted-TSV diff (diff <(orig) <(rewrite) empty for all five).

Timing — GET_EXPERIMENT_REFS_BY_SPAN_IDS (the prod-slow variant; 1 warmup + 5 timed runs each, interleaved)

metric FINAL LIMIT 1 BY improvement
duration (median) 1,034 ms 101 ms 10.2x faster
read_bytes (avg) 228 MiB 33.7 MiB 6.8x less
memory_usage (avg) 103 MiB 2.5 MiB 41x less
cpu_time (avg) 823 ms 133 ms 6.2x less
marks scanned ~9,485 ~9,479 same scan width

Timing — FILTER_EXPERIMENT_IDS_BY_STATUS (small table; 1 warmup + 5 timed runs each)

metric FINAL LIMIT 1 BY improvement
duration (mean) ~7.5 ms ~7.2 ms no change (table is small)
read_bytes (avg) 3.95 MiB 1.47 MiB 2.7x less
memory_usage (avg) ~3.9 MiB ~33 KiB ~120x less
cpu_time (avg) ~11.2k µs ~6.4k µs 1.7x less

Timing — STREAM spans aggregation (200 trace_ids; 1 warmup + 5 timed runs each)

metric FINAL LIMIT 1 BY improvement
duration (avg) ~192 ms ~63 ms ~3.0x faster
read_rows (avg) ~410k ~196k ~2.1x less (FINAL re-reads dupes)
read_bytes (avg) ~136 MiB ~15.7 MiB ~8.7x less
memory_usage (avg) ~79 MiB ~38 MiB ~2.1x less
cpu_time (avg) ~500k µs ~93k µs ~5.4x less
marks/parts ~1,060 / ~135 identical same scan width

Memory drop is the headline number across all variants — it directly raises how many of these can run concurrently without cluster pressure.

Local commands run:

  • cd apps/opik-backend && mvn -q spotless:check -pl . — passes

CI to run automatically:

  • backend_formatting_check.yml (spotless)
  • backend_tests.ymlExperimentItemProcessorTest and ExperimentAggregateEventListenerTest exercise these paths; output is unchanged so existing tests pass without modification

Documentation

N/A — internal SQL refactor with no public API or behavior change.

…fs queries

Production analysis showed `GET_EXPERIMENT_REFS_BY_SPAN_IDS` running
~2,196 times/hour for a single workspace, averaging ~12s per call and
reading ~1.26 TiB/hour. The three sibling queries
(GET_EXPERIMENT_REFS_BY_TRACE_IDS, _ITEM_IDS, _SPAN_IDS) all share the
same `experiment_items FINAL JOIN experiments FINAL` anti-pattern.

Rewrite each as CTEs that explicitly dedup with
`ORDER BY last_updated_at DESC` + `LIMIT 1 BY <full sort key>`.
This is semantically equivalent to FINAL on a ReplacingMergeTree (the
version column is `last_updated_at` for all three tables) and is
already used in TraceDAO.find.

Why FINAL is slow here: it materializes whole rows of both tables for
the merge before WHERE/JOIN apply, even though the predicates are
highly selective. The CTE form reads only projected columns and
streams over the already-sorted primary key.

Validation (workspace 09d8a1ca-…, opik_prod, prod cluster):
- TRACE_IDS: 188 rows; FINAL vs rewrite — bit-for-bit identical
- ITEM_IDS:  20  rows; FINAL vs rewrite — bit-for-bit identical
- SPAN_IDS:  85  rows; FINAL vs rewrite — bit-for-bit identical

Timing on the SPAN_IDS slow path (1 warmup + 5 timed runs each):
- duration median: FINAL 1034 ms -> rewrite 101 ms (10x)
- read_bytes avg:  FINAL 228 MiB -> rewrite 33.7 MiB (6.8x less)
- memory avg:      FINAL 103 MiB -> rewrite 2.5 MiB (41x less)
- cpu_time avg:    FINAL 823 ms  -> rewrite 133 ms (6.2x less)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

📋 PR Linter Failed

Missing Section. The description is missing the ## Details section.


Missing Section. The description is missing the ## Change checklist section.


Missing Section. The description is missing the ## Issues section.


Missing Section. The description is missing the ## Testing section.


Missing Section. The description is missing the ## Documentation section.

@github-actions github-actions Bot added java Pull requests that update Java code Backend labels May 6, 2026
@YarivHashaiComet
Copy link
Copy Markdown
Collaborator Author

baz review

YarivHashaiComet and others added 2 commits May 6, 2026 18:14
…IMIT 1 BY

Same pattern as the three GET_EXPERIMENT_REFS_BY_* constants, on the same
hot path (`ExperimentAggregateEventListener.triggerByExperimentIds`).

Production sample (last 60 min): 651 calls/hour, avg 11ms per call,
total 2.43 GiB read and avg 5.83 MiB memory per call.

Empirical comparison on prod (workspace b116eddf-…, 1 warmup + 5 timed
runs each, interleaved):
- duration:      ~7.5 ms -> ~7.2 ms (no meaningful change; small table)
- read_bytes:    3.95 MiB -> 1.47 MiB (2.7x less)
- memory_usage:  ~3.9 MiB -> ~33 KiB (~120x less)
- cpu_time:      ~11.2k us -> ~6.4k us (~1.7x less)
- read_rows / marks / parts: identical (same physical scan)

Result equivalence verified (bit-for-bit identical via sorted-TSV diff).

Latency win is negligible because `experiments` is small, but the memory
and bytes-read drops still pay back at 651 calls/hour, and consistency
matters: this DAO already uses LIMIT 1 BY 14 times, so this brings the
last FINAL outlier in line with the rest of the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The STREAM constant (line 115) had one remaining `FROM spans final` inside
the LEFT JOIN that aggregates per-trace total_estimated_cost / usage.
Same anti-pattern as the four constants already rewritten, so apply the
same fix here for consistency and to land the perf wins on the export
path.

Empirical comparison on prod (workspace 09d8a1ca-…, 200 trace_ids;
1 warmup + 5 timed runs each, interleaved):
- duration:      ~192 ms -> ~63 ms  (~3.0x faster)
- read_rows:     ~410 k -> ~196 k   (~2.1x less; FINAL re-reads dupes)
- read_bytes:    ~136 MiB -> ~15.7 MiB (~8.7x less)
- memory_usage:  ~79 MiB -> ~38 MiB  (~2.1x less)
- cpu_time:      ~500k us -> ~93k us (~5.4x less)
- marks/parts:   identical (~1060 / ~135) — same physical scan width

Result equivalence verified — 200 rows, bit-for-bit identical via
sorted-TSV diff.

Prod volume on this query is low (~2 calls / 6 hours by log_comment
`get_experiment_items_stream`, used as an export endpoint), so this
isn't a frequency hotspot — but per-call cost is dominant for the
endpoint and improvements scale with experiment size.

After this commit, no `FINAL` SQL operators remain in this DAO; the
only remaining mention is the explanatory comment on the rewrite
pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thiagohora
Copy link
Copy Markdown
Contributor

thiagohora commented May 6, 2026

We shouldn't be blindly changing all those queries. We checked, and there seem to be some scenarios where the FINAL performs better than ORDER BY ... LIMIT 1 BY... . Here is an example: #6567 (comment) . Worth testing on large workspaces.

Not just this, but the ORDER BY only optimizes if the prefix of the sorting is exactly the same as the table sortable key.

Copy link
Copy Markdown
Contributor

@thiagohora thiagohora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All sortable keys are not correct. Also, please verify on larger workspaces with significant experiment volume.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants