Skip to content

[OPIK-5694] [BE] fix: eliminate FORMAT Values fast-path noise in batch inserts#6572

Open
ldaugusto wants to merge 7 commits intomainfrom
daniela/OPIK-5694-eliminate-format-values-fastpath-noise
Open

[OPIK-5694] [BE] fix: eliminate FORMAT Values fast-path noise in batch inserts#6572
ldaugusto wants to merge 7 commits intomainfrom
daniela/OPIK-5694-eliminate-format-values-fastpath-noise

Conversation

@ldaugusto
Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto commented Apr 30, 2026

Details

"All involve parseDateTime64BestEffort(...) and now64(...) being literal-substituted into the SQL string by the query builder, rather than bound as parameters. […] Datetime fields are textually substituted into the SQL string before the row is sent."
(Liya Katz nailed the root cause on the ticket)

This PR is the empirical realisation of that diagnosis. After the previous code-70 fix landed, production ClickHouse was still logging a sustained stream of CANNOT_PARSE_INPUT_ASSERTION_FAILED (code 27), ILLEGAL_TYPE_OF_ARGUMENT (code 43), and CANNOT_PARSE_QUOTED_STRING (code 26) errors from the trace/span/thread batch ingestion paths. The inserts succeed, but every batch INSERT statement counts an error in system.errors and writes to pod stderr.

We empirically isolated this against ClickHouse 25.3.6.56 (test container) and 25.3.8 (production version): every column whose value in the per-row tuple is a function call instead of a recognised literal trips the Values fast-path and counts an assertion in system.errors. The Values fast path only recognises canonical literals — quoted strings, plain numbers, NULL, array literals [...], map literals {key : value}. Anything else (parseDateTime64BestEffort(...), now64(...), if(... IS NULL, ...), toDecimal128(...), mapFromArrays(...), '...'::Type casts) falls back to the slow path and increments the counter once per offending cell per statement.

Fix

For every datetime/decimal/map cell in the three FORMAT Values templates affected, build the canonical literal in Java and bind it as a plain value, dropping every wrapping function from the SQL.

Template Cell Before After
TraceDAO.BATCH_INSERT start_time parseDateTime64BestEffort(:start_time, 9) :start_time (canonical text via ClickHouseDateTimeFormat.formatNanos)
end_time if(:end_time IS NULL, NULL, parseDateTime64BestEffort(...)) :end_time (canonical text or bindNull → literal NULL)
last_updated_at if(:last_updated_at IS NULL, now64(6), parseDateTime64BestEffort(...)) :last_updated_at (canonical text; null → Instant.now() formatted)
visibility_mode if(:visibility_mode IS NULL, 'default', :visibility_mode) :visibility_mode (binds 'default' literal when null)
SpanDAO.BULK_INSERT start_time / end_time / last_updated_at same as above same as above
total_estimated_cost toDecimal128(:total_estimated_cost, 12) :total_estimated_cost (bind BigDecimal directly → driver emits plain numeric)
usage mapFromArrays(:usage_keys, :usage_values) :usage (bind Map<String, Integer> directly → driver emits {'k' : v} literal)
TraceThreadDAO.INSERT_THREADS_SQL created_at / last_updated_at parseDateTime64BestEffort(...) canonical text in Java
sampling_per_rule mapFromArrays(:rule_ids, :sampling) :sampling_per_rule (bind Map<UUID, Boolean> directly)

Other templates in the same files (single-row INSERT … SELECT, UPDATE_SPAN_SQL, UPDATE_THREAD_SQL) were audited and left alone — they don't use FORMAT Values, so the fast-path issue doesn't apply.

New util ClickHouseDateTimeFormat exposes formatNanos / formatMicros (UTC) for the two precision flavours used by the schema.

The ClickHouse skill (.agents/skills/opik-backend/clickhouse.md) gets a short section codifying the rule for future contributors: in FORMAT Values templates every cell must be a plain :placeholder bound to a value the driver serialises as a literal — function calls, casts, and if(...) wrappers in cells flood system.errors even when the insert succeeds.

Behaviour visible to API consumers is unchanged: lastUpdatedAt still defaults to "now" when omitted (the column DEFAULT was now64(6), we now compute Instant.now() in Java and format it the same way), visibilityMode still defaults to 'default', costs and usage are still recorded with the same precision and shape.

The previous code-70 fix on this ticket (if(... IS NULL, now64(6), parseDateTime64BestEffort(...))) is now superseded — the canonical-literal approach eliminates the same NULL→non-nullable conversion failure from a different angle, plus codes 26/27/43.

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-5694

AI-WATERMARK

AI-WATERMARK: yes

  • Tools: Claude Code
  • Model(s): Claude Opus 4.6 / 4.7
  • Scope: assisted (empirical isolation against two ClickHouse builds via raw HTTP, fix design, regression tests, PR write-up). Liya Katz authored the root-cause analysis on the ticket; this PR realises it.
  • Human verification: code review + side-by-side comparison of system.errors counters before/after the fix; targeted run of all 31 BatchInsert tests across TracesResourceTest and SpansResourceTest

Testing

Commands run:

mvn -B compile -DskipTests
mvn -B spotless:check
mvn -B test -Dtest='TracesResourceTest$BatchInsert,SpansResourceTest$BatchInsert'
mvn -B test -Dtest='FindTraceThreadsResourceTest,TraceThreadListenerTest,TraceThreadsClosingJobTest'

Results:

  • TracesResourceTest$BatchInsert: 19 tests pass.
  • SpansResourceTest$BatchInsert: 12 tests pass.
  • FindTraceThreadsResourceTest + TraceThreadListenerTest + TraceThreadsClosingJobTest: 407 tests pass (covers the modified TraceThreadDAO.INSERT_THREADS_SQL path).

Scenarios validated:

  • Pre-fix tests batch__whenTracesAreInserted__thenNoFastPathErrorsEmitted and batch__whenSpansAreInserted__thenNoFastPathErrorsEmitted measure system.errors deltas for codes 26 / 27 / 43 / 70 across batchCreateTraces / batchCreateSpans. With the fix, all four counters' deltas are zero. Without the fix the same tests would fail on multiple counters at once.
  • Persistence verified: each test calls getById for every inserted row to confirm rows are still present and the materialised duration column still computes correctly.
  • Bug also empirically reproduced via direct curl against ClickHouse 25.3.6.56-alpine (test container) and 25.3.8-alpine (production version) to confirm the result is not version-specific. Fixed-shape SQL produces zero increments on both.

Not run: full backend regression suite (out of scope for this fix).

Documentation

N/A — no behaviour change visible to API consumers.

…span/thread batch INSERTs

Per Liya Katz's investigation on OPIK-5694, the silent error counter
pollution in production ClickHouse (~system.errors codes 26 / 27 / 43 /
70) traces back to datetime function expressions being substituted into
the per-row tuple in `FORMAT Values`:

  parseDateTime64BestEffort(:start_time, 9),
  if(:end_time IS NULL, NULL, parseDateTime64BestEffort(:end_time, 9)),
  if(:last_updated_at IS NULL, now64(6), parseDateTime64BestEffort(...)),
  ...

Empirical isolation against ClickHouse 25.3.6.56 (test container) and
25.3.8 (production version) confirmed: every datetime cell expressed as
a function call adds +1 to code 27 per statement; NULL substituted into
a non-nullable column adds +1 to code 70. The Values fast-path parser
only recognises canonical text literals for DateTime64 cells; any
function or `::` cast falls back to the slow path and counts the
assertion in system.errors.

Fix: format datetime values to canonical ClickHouse text in Java
(`yyyy-MM-dd HH:mm:ss.SSSSSSSSS` for precision 9, `.SSSSSS` for
precision 6) and bind them as plain strings, dropping every parse /
if-wrap from the per-row tuple. The clickhouse-r2dbc driver substitutes
String values as `'<canonical>'` literals which the Values fast-path
recognises directly.

Same canonical-literal idea applied to:
- TraceDAO.BATCH_INSERT: start_time, end_time, last_updated_at,
  visibility_mode (was an `if(... IS NULL, 'default', ...)` wrap;
  now bound as 'default' literal when null)
- SpanDAO.BULK_INSERT: start_time, end_time, last_updated_at
  (toDecimal128 / mapFromArrays for cost / usage left for a follow-up
  per scope discussion)
- TraceThreadDAO.INSERT_THREADS_SQL: created_at, last_updated_at, and
  sampling_per_rule (the existing mapFromArrays(:rule_ids, :sampling)
  was replaced by binding a Java Map directly — the clickhouse-r2dbc
  driver natively serialises Map<UUID, Boolean> as `{'uuid': true}` map
  literal, no SQL function needed)

New util: ClickHouseDateTimeFormat with NANOS / MICROS formatters in UTC.

Tests: regression tests in TracesResourceTest$BatchInsert and
SpansResourceTest$BatchInsert assert codes 26 / 27 / 43 / 70 deltas
against system.errors before/after batchCreateTraces /
batchCreateSpans:
- Traces: all four codes assert isZero (covers BATCH_INSERT and the
  side-effect trace_threads INSERT_THREADS_SQL that ingestion triggers)
- Spans: code 70 asserts isZero. Codes 26 / 27 / 43 are NOT asserted on
  spans because BULK_INSERT still has toDecimal128 / mapFromArrays in
  non-datetime cells; eliminating those is out of scope for this PR.

Behaviour for clients is unchanged: `lastUpdatedAt` still defaults to
"now" when omitted; `visibilityMode` still defaults to 'default'.
…omArrays from BULK_INSERT

Same FORMAT Values fast-path issue as the rest of OPIK-5694: function
expressions in tuple cells trip ClickHouse's Values fast-path parser
and bump system.errors codes 26 / 27 / 43.

In SpanDAO.BULK_INSERT the remaining offenders were:
- toDecimal128(:total_estimated_cost, 12)  -- Decimal128 column
- mapFromArrays(:usage_keys, :usage_values) -- Map(String, Int32) column

The clickhouse-r2dbc driver natively serialises both types as canonical
literals: BigDecimal -> plain numeric (e.g. 0.000123), Map -> {key : value}
literal. So we just bind the values directly:
- bindCost: pass BigDecimal (not .toString()) so SQL gets `0.000123`
  instead of `'0.000123'` (the quoted form also trips the fast-path,
  empirically verified)
- usage: assemble a Map<String, Integer> in Java and bind it; the driver
  emits `{'in' : 2, 'out' : 5}` map literal

Other SpanDAO templates (single-row INSERT, UPDATE_SPAN_SQL) still use
toDecimal128/mapFromArrays and are intentionally untouched -- those are
INSERT...SELECT statements, not FORMAT Values, so they don't hit the
Values fast-path.

Tests: SpansResourceTest now asserts all four codes (26/27/43/70) stay
at zero after batchCreateSpans, matching the traces test.
@github-actions github-actions Bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Apr 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

Backend Tests - Integration Group 9

328 tests   326 ✅  4m 58s ⏱️
 30 suites    2 💤
 30 files      0 ❌

Results for commit 80e9687.

♻️ This comment has been updated with latest results.

Comment thread apps/opik-backend/src/main/java/com/comet/opik/domain/SpanDAO.java Outdated
…h insert tests

Audit of the regression tests revealed a gap: the new batch-insert tests
only exercised some null/non-null permutations of the fields this PR
touches. Beef up both tests so every branch the SQL/binding cares about
is hit:

Traces test (3 rows in one batch):
- Row A: endTime=null, lastUpdatedAt=null, visibilityMode=null
- Row B: endTime!=null, lastUpdatedAt!=null, visibilityMode=DEFAULT
- Row C: endTime!=null, lastUpdatedAt!=null, visibilityMode=HIDDEN

Spans test (2 rows in one batch):
- Row A: endTime=null, lastUpdatedAt=null, usage=null, totalEstimatedCost=null
- Row B: endTime!=null, lastUpdatedAt!=null, usage={prompt:12, completion:7},
  totalEstimatedCost=BigDecimal("0.000123456789")

Both tests still strictly assert codes 26 / 27 / 43 / 70 deltas == 0
against system.errors.

startTime is @NotNull at the API layer for both Trace and Span, so the
null branch is unreachable through the public batch endpoint and not
worth testing here.
Comment thread apps/opik-backend/src/main/java/com/comet/opik/domain/TraceDAO.java Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

Backend Tests - Integration Group 8

 35 files   35 suites   7m 25s ⏱️
458 tests 457 ✅ 1 💤 0 ❌
442 runs  441 ✅ 1 💤 0 ❌

Results for commit 80e9687.

♻️ This comment has been updated with latest results.

CI surfaced a precision regression in SpansResourceTest$CreateSpan#createAndGetCost
and $UpdateSpan#update__whenCostIsChanged after the previous commit changed
bindCost to bind BigDecimal directly.

Root cause: bindCost is shared across SQL templates with different shapes.
- BULK_INSERT now writes the cell as a bare :placeholder (Decimal128(12)
  column). It needs a BigDecimal binding so the driver emits an unquoted
  numeric — CH parses straight into Decimal128 and is lossless.
- The single-row INSERT and UPDATE templates still wrap the cell as
  toDecimal128(:cost, 12). Those need a String binding so the driver wraps
  with single quotes and CH parses via the Decimal-aware string path. With
  a BigDecimal binding the unquoted literal toDecimal128(123.45, 12) routes
  through Float64 and drops digits beyond ~17 significant figures (the test
  failure mode).

Fix:
- Restore bindCost to the original String binding (.toString()) — keeps
  INSERT/UPDATE precision intact.
- Add bindCostForBulkInsert that binds BigDecimal directly. Used only in
  the BULK_INSERT loop. Keeps that cell at +0 system.errors increments.

Also added a brief section to the ClickHouse skill documenting the rule:
in FORMAT Values templates, every cell must be a plain placeholder bound
to a value the driver serialises as a canonical literal — function calls
in cells flood system.errors / pod stderr even when the insert itself
succeeds.
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 1, 2026
@ldaugusto ldaugusto marked this pull request as ready for review May 1, 2026 09:37
@ldaugusto ldaugusto requested review from a team as code owners May 1, 2026 09:37
- Inline the BigDecimal-direct cost bind at the BULK_INSERT call site (only
  call site for that variant), drop the bindCostForBulkInsert helper and
  the docstring on the existing bindCost — both unnecessary for one
  callsite.
- Tighten the ClickHouse skill addition to drop the redundant single-row
  INSERT/UPDATE caveat (the rule scope is already in the heading) and
  dedupe overlapping language.
Comment thread apps/opik-backend/src/main/java/com/comet/opik/domain/SpanDAO.java
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Backend Tests - Integration Group 11

204 tests   204 ✅  4m 7s ⏱️
 30 suites    0 💤
 30 files      0 ❌

Results for commit 1857bc0.

♻️ This comment has been updated with latest results.

CI surfaced ProjectsResourceTest$FindProject#getProjects__whenProjectsHasTracesBatch__thenReturnProjectWithLastUpdatedTraceAt
failing with a sub-millisecond timestamp diff between getById of the last
trace and the project's MAX(last_updated_at). Reproduces under CI load
but not deterministically in isolation.

Root cause: the prior server-side wrap was if(:last_updated_at IS NULL,
now64(6), parseDateTime64BestEffort(...)). ClickHouse evaluates now64() once
per query — every batch row with null lastUpdatedAt got the SAME timestamp,
making `getById(any).lastUpdatedAt == MAX(...)` trivially true. After the
canonical-literal refactor we were calling Instant.now() per row in the bind
loop, so each null-lastUpdatedAt row got a distinct sequential timestamp.
While `getById(traces.getLast())` should still equal MAX in the steady case,
JVM scheduling jitter under CI load made the assertion fragile.

Fix: capture Instant.now() once per batch (outside the bind loop) and reuse
it for every row whose client did not provide lastUpdatedAt. Restores the
prior single-timestamp-per-batch invariant the test relies on.

Same change in SpanDAO. TraceThreadDAO doesn't need it — that DAO never
falls back to Instant.now(); it always uses item.lastUpdatedAt() / item.createdAt().
Comment thread apps/opik-backend/src/main/java/com/comet/opik/domain/SpanDAO.java
Review feedback on PR #6572:

The scored_at cell in INSERT_THREADS_SQL is a real fast-path concern that
slipped through the original audit. INSERT_THREADS_SQL is a FORMAT Values
template and binds scored_at as a bare :placeholder (no parseDateTime64BestEffort
wrap). Until now we passed Instant.toString() (ISO-8601 with T/Z), which the
new ClickHouseDateTimeFormat util explicitly warns against — when scoredAt is
non-null (re-insert after scoring), CH would not recognise the literal and
would trip codes 26/27/43 the same way the rest of the PR addresses.

Fix line 335 to use ClickHouseDateTimeFormat.formatNanos(item.scoredAt()).

Also applied the same canonical-text format at the two other scored_at bind
sites (UPDATE_THREAD_SQL line 516 and UPDATE_THREAD_SCORED_AT line 566) — those
are INSERT...SELECT / UPDATE statements wrapped in parseDateTime64BestEffort
so they were not affected by the fast-path issue, but consistent canonical
binding is zero-cost and avoids someone copy-pasting the ISO-8601 form into
a future FORMAT Values template.

Aligned bindNull("scored_at", Instant.class) → String.class with the rest of
the nullable-DateTime cells in this PR.
Comment on lines 334 to 338
if (item.scoredAt() != null) {
statement.bind("scored_at" + i, item.scoredAt().toString());
statement.bind("scored_at" + i, ClickHouseDateTimeFormat.formatNanos(item.scoredAt()));
} else {
statement.bindNull("scored_at" + i, Instant.class);
statement.bindNull("scored_at" + i, String.class);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapAndInsert and updateThreadSampledValues both duplicate the same scoredAt bind/null block, so should we extract a shared helper like bindScoredAt(statement, prefix, scoredAt) to keep serialization in one place?

Finding type: Code Dedup and Conventions | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Copy link
Copy Markdown
Member

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ldaugusto — I saw that a similar PR was recently merged, but before we move forward with reviewing this one, have we explored the possibility of addressing these errors at the source (e.g., via configuration)?

From what I can see, the functions being used here are valid, tests are passing, and the queries/DAOs behave as expected. The entries in ClickHouse system.errors seem to be more of an unintended side effect rather than an indication of incorrect behavior.

While we can patch individual usages, this approach may not scale well—engineers could legitimately use similar (or new) functions in the future, and we might end up reintroducing the same kind of noise.

It might be worth investigating whether there’s a configuration-based solution to suppress or handle these cases more appropriately, rather than modifying code paths that are otherwise correct and functioning as intended.

@ldaugusto
Copy link
Copy Markdown
Contributor Author

@andrescrz during this task I've learned when there's a INSERT ... VALUES, Clickhouse will first try a stream parser to handle the values, so it doesn't have to load everything in memory (which is great for our traces/spans inserts). This stream parser is faster, but it only supports literals. Once it sees an expression, the parser fails, increments error count, logs the error and than fallback to the full SQL parser - slower and using more memory, but it can handle expressions, so the statement completes correctly (and the DAOs work).

I guess we could set a config to discard these loggings, but we would be dropping logs when we have new/actual problems, so I don't believe it's a good idea.

Using this approach here we can make our queries to run without triggering these logs and using only the fastpath (stream parser). I've updated the Clickhouse skill to try to avoid this problem to expand.

Copy link
Copy Markdown
Member

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor suggestions, but LGTM. Thanks for the investigation, the fix and the improvements!

Comment on lines +1565 to +1573
// BULK_INSERT writes the cost cell directly into Decimal128(12) (no toDecimal128
// wrap), so the driver must emit an unquoted numeric literal — bind the
// BigDecimal itself rather than its String form. See OPIK-5694.
BigDecimal cost = span.totalEstimatedCost() != null ? span.totalEstimatedCost() : calculateCost(span);
statement.bind("total_estimated_cost" + i, cost);
statement.bind("total_estimated_cost_version" + i,
span.totalEstimatedCost() == null && cost.compareTo(BigDecimal.ZERO) > 0
? ESTIMATED_COST_VERSION
: "");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: DRY and encapsulate, aiming to reuse the previous bindCost method. Refactor that one if you need to.

Comment on lines +34 to +38
public static String formatNanos(Instant instant) {
return NANOS.format(instant);
}

public static String formatMicros(Instant instant) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: if non-null isn't handled (I believe so), annotate the instant arg with lombok nonnull.

Comment on lines +2097 to +2098
@DisplayName("when batch spans are inserted, no FORMAT Values fast-path errors are emitted (OPIK-5694)")
void batch__whenSpansAreInserted__thenNoFastPathErrorsEmitted(
Copy link
Copy Markdown
Member

@andrescrz andrescrz May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: you created these tests in the previous fix. Honestly, feel free to completely clean them up. Reasons:

  1. This is more an implementation detail at ClickHouse level that we shouldn't test directly. It should be caught at the operational level, like it was. Your manual test is enough, I believe we don't need regression.
  2. It directly reads ClickHouse, with breaks the principle of this test class (and general in the service) of being black box tests.

Anyhow, if you prefer to keep them.

@Test
@DisplayName("when batch contains a trace with null lastUpdatedAt, then no CANNOT_CONVERT_TYPE errors are emitted (OPIK-5694)")
void batch__whenTraceHasNullLastUpdatedAt__thenNoCannotConvertTypeErrors(
@DisplayName("when batch traces are inserted, no FORMAT Values fast-path errors are emitted (OPIK-5694)")
Copy link
Copy Markdown
Member

@andrescrz andrescrz May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: same about this other test. I recommend deleting it completely.

@andrescrz
Copy link
Copy Markdown
Member

@andrescrz during this task I've learned when there's a INSERT ... VALUES, Clickhouse will first try a stream parser to handle the values, so it doesn't have to load everything in memory (which is great for our traces/spans inserts). This stream parser is faster, but it only supports literals. Once it sees an expression, the parser fails, increments error count, logs the error and than fallback to the full SQL parser - slower and using more memory, but it can handle expressions, so the statement completes correctly (and the DAOs work).

I guess we could set a config to discard these loggings, but we would be dropping logs when we have new/actual problems, so I don't believe it's a good idea.

Using this approach here we can make our queries to run without triggering these logs and using only the fastpath (stream parser). I've updated the Clickhouse skill to try to avoid this problem to expand.

@ldaugusto discussed offline. Thanks for the explanation, that makes sense. They key learning is that it's not only noise error logs, the fast parser isn't used, which has side effects on amount of memory and speed, so it's desirable to be tackled. In addition, I'd rather have an architecture of cleaner queries without business logic in the form of SQL functions, so I like the improvement of making those more neutral and just passing plain values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend documentation Improvements or additions to documentation java Pull requests that update Java code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants