Skip to content

[OPIK-6218] [BE] fix: compare endpoint missing version-level evaluators#6487

Open
itamargolan wants to merge 4 commits intomainfrom
itamar/OPIK-6218-compare-endpoint-missing-version-evaluators
Open

[OPIK-6218] [BE] fix: compare endpoint missing version-level evaluators#6487
itamargolan wants to merge 4 commits intomainfrom
itamar/OPIK-6218-compare-endpoint-missing-version-evaluators

Conversation

@itamargolan
Copy link
Copy Markdown
Contributor

@itamargolan itamargolan commented Apr 24, 2026

Details

The compare endpoint (GET /api/v1/datasets/{id}/items/experiments/items) only returned item-level evaluators from ClickHouse dataset_item_versions, never version-level evaluators from MySQL. This caused dataset items with no item-level assertions to show "Skipped - No assertions defined" in the UI, even when the dataset version had version-level assertions that applied to all items.

The fix builds a per-version evaluator map by looking up each experiment's dataset_version_id, fetching the corresponding version evaluators from MySQL, and merging them into each item — version-level first, then item-level — matching TestSuiteAssertionSampler behavior. Items from different dataset versions each get their own version's evaluators.

  • Added ExperimentDAO dependency to DatasetItemServiceImpl to resolve experiment → dataset version mapping
  • Flowed dataset_version_id through ClickHouse query, result mapper, and DatasetItem DTO (as @JsonIgnore) so each item knows which version it belongs to
  • Renamed versionIdlegacyFallbackVersionId in the DAO interface to clarify its purpose (only used for pre-versioning experiments)
  • Demoted DAO-level log from INFO to DEBUG (service already logs at INFO)
  • Wrapped blocking MySQL call in Mono.fromCallable + subscribeOn(Schedulers.boundedElastic())

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-6218

AI-WATERMARK

AI-WATERMARK: yes

  • Tools: Claude Code (CLI)
  • Model(s): Claude Opus 4.6
  • Scope: implementation, tests
  • Human verification: code reviewed, tests verified, UI tested locally

Testing

  • mvn test -Dtest="DatasetItemServiceApplyVersionEvaluatorsTest" — unit tests covering both applyVersionEvaluators (parameterized empty/null/empty-list variants, merge ordering, multiple version evaluators, field preservation) and applyVersionEvaluatorsToPage (per-version routing, unknown version skipping, null version skipping, empty map passthrough)
  • mvn test -Dtest="DatasetItemServiceFilterDataTest" — 8 existing tests pass with updated constructor
  • mvn spotless:check — formatting clean
  • Local UI testing via scripts/dev-runner.sh --restart:
    • Playground: run experiment on test suite v3 with version-level assertions → all items show evaluator results
    • Compare experiments page: items display correct evaluators

Documentation

No documentation changes needed — this is a backend bug fix with no API contract changes.

The compare endpoint only returned item-level evaluators from ClickHouse,
never version-level evaluators from MySQL. This caused items with no
item-level assertions to show "Skipped" even when the version had assertions.

Now fetches the experiment's actual dataset version evaluators and merges
them into each item (version-level first, then item-level), matching the
experiment runner behavior in TestSuiteAssertionSampler.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@itamargolan itamargolan requested a review from a team as a code owner April 24, 2026 18:45
@github-actions github-actions Bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Apr 24, 2026
Comment thread apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemService.java Outdated
Comment thread apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemService.java Outdated
itamargolan and others added 2 commits April 24, 2026 16:00
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Build per-version evaluator map instead of picking arbitrary experiment
- Each item gets its own version's evaluators merged with item-level ones
- Flow dataset_version_id through ClickHouse query → mapper → DTO
- Demote DAO INFO log to DEBUG (service already logs at INFO)
- Collapse 3 empty-evaluator tests into @ParameterizedTest

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

Backend Tests - Integration Group 1

414 tests   414 ✅  15m 57s ⏱️
 23 suites    0 💤
 23 files      0 ❌

Results for commit 8213f37.

♻️ This comment has been updated with latest results.

…apper

ClickHouse FixedString(36) returns 36 null bytes on LEFT JOIN miss,
which passes isBlank() but fails UUID.fromString(). Use trim()+isEmpty()
since trim() strips chars <=   including \0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@thiagohora thiagohora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall direction looks right — merging version-level evaluators on the service after the ClickHouse query (rather than cross-store joining) is the sensible call, and the merge order matches TestSuiteAssertionSampler (verified at lines 213–217).

Comment on lines +1350 to +1352
for (UUID versionId : versionIds) {
try {
DatasetVersion version = versionService.getVersionById(workspaceId,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N+1 query — biggest concern in this PR.

This loop calls versionService.getVersionById(...) once per version, and each call opens its own inTransaction(READ_ONLY, ...). On the compare endpoint, experimentIds is user-driven (selecting many experiments to compare), so we can easily issue 5–10+ sequential MySQL round-trips per page request.

DatasetVersionService already exposes a batch API designed exactly for this:

List<DatasetVersion> findByIds(Collection<UUID> versionIds, String workspaceId); // line 160

Suggested replacement:

return Mono.fromCallable(() -> versionService.findByIds(versionIds, workspaceId))
        .subscribeOn(Schedulers.boundedElastic())
        .map(versions -> versions.stream()
                .filter(v -> CollectionUtils.isNotEmpty(v.evaluators()))
                .collect(Collectors.toMap(DatasetVersion::id, DatasetVersion::evaluators)));

Versions absent from the response are simply skipped — same behavior as the current NotFoundException catch, no warning needed because missing-after-delete is expected here.

return versionDao.getItemsWithExperimentItems(criteria, page, size, resolvedFallbackVersionId)
.defaultIfEmpty(DatasetItemPage.empty(page, sortingFactory.getSortableFields()));
// Build a per-version evaluator map so each item gets its own version's evaluators.
Mono<Map<UUID, List<EvaluatorItem>>> evaluatorsByVersionMono = experimentDAO.getByIds(criteria.experimentIds())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExperimentDAO.getByIds is heavier than what's needed here.

getByIds runs the full experiment aggregation pipeline (getAggregationBranchCounts, mapToDto, all the joined data) just to read each experiment's datasetVersionId. Since the compare endpoint is interactive, the extra cost shows up on every paginated read.

There's already a near-perfect lighter alternative — ExperimentDAO.getExecutionPoliciesByIds (line 2270) returns ExperimentPolicyInfo(experimentId, policy, datasetVersionId). Two options:

  1. Cheapest: reuse getExecutionPoliciesByIds and ignore the policy field — zero new code.
  2. Add a tiny dedicated method getDatasetVersionIdsByIds(Set<UUID>) if reusing the policy method feels semantically wrong.

Either way, the win is avoiding the aggregation query just to extract one column.

versionId, criteria.datasetId());
}
}
return Map.copyOf(map);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Map.copyOf(map) is unnecessary here. The map is local to the callable, isn't shared after return, and nothing mutates it later — returning map directly is fine. (If you switch to the findByIds batch approach above, this disappears anyway.)

@JsonView({
DatasetItem.View.Public.class}) @Schema(accessMode = Schema.AccessMode.READ_ONLY) String lastUpdatedBy){
DatasetItem.View.Public.class}) @Schema(accessMode = Schema.AccessMode.READ_ONLY) String lastUpdatedBy,
@JsonIgnore UUID datasetVersionId){
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal control field on a public API record.

Adding datasetVersionId to the DatasetItem record (even with @JsonIgnore) has two side effects worth being deliberate about:

  1. Records auto-generate equals/hashCode/toString from all components, so any test or dedup logic that compares DatasetItem instances now silently uses an additional discriminator. Worth grepping callers to confirm none rely on equality.
  2. It leaks an internal carrier field onto the public DTO. The OpenAPI/Swagger generator may still pick this up — consider also adding @Schema(hidden = true) to suppress it from the spec.

Per the project's "comment the why, not the what" rule, a one-line comment on the field stating why it lives here (carrier for service-side per-version evaluator merge) would help future readers. Cleaner alternatives (separate internal carrier type, an attribute map on a wrapper) are likely overkill — but please call the choice out explicitly.

Copy link
Copy Markdown
Contributor

@thiagohora thiagohora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: posting one more comment on the broader anti-pattern in the Mono.fromCallable block — the issues there compound (sequential blocking, exception-as-control-flow, mutable accumulator, defensive Map.copyOf) and are best fixed as a single cleanup rather than separately.

Comment on lines +1348 to +1363
return Mono.fromCallable(() -> {
var map = new HashMap<UUID, List<EvaluatorItem>>();
for (UUID versionId : versionIds) {
try {
DatasetVersion version = versionService.getVersionById(workspaceId,
criteria.datasetId(), versionId);
if (CollectionUtils.isNotEmpty(version.evaluators())) {
map.put(versionId, version.evaluators());
}
} catch (NotFoundException e) {
log.warn("Version '{}' not found for dataset '{}'",
versionId, criteria.datasetId());
}
}
return Map.copyOf(map);
}).subscribeOn(Schedulers.boundedElastic());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back — this whole Mono.fromCallable block is doing several things that, taken together, count as a reactive anti-pattern:

  1. Sequential blocking inside fromCallable. Each getVersionById is a blocking JDBI round-trip, and we run them in a for loop on a single boundedElastic thread. That thread is parked for the entire chain of N queries — exactly what boundedElastic is meant to prevent. The N+1 cost compounds with thread-occupancy cost.
  2. Exceptions used for control flow. NotFoundException is the expected outcome when a version was deleted (test suite removed). Catching + warn-logging per iteration is noisy and wasteful; in a batch result, "missing" is just "not in the returned list."
  3. Mutable HashMap accumulator in a reactive lambda. Works because it's local, but indicates we're imperatively bolting a loop into the reactive pipeline rather than expressing it as data flow.
  4. Map.copyOf on a local map — already flagged separately, but it's of a piece with the broader pattern: ceremony around a structure that shouldn't exist in the first place.

All four collapse if we replace the loop with versionService.findByIds(versionIds, workspaceId) (or, even better, a Flux<DatasetVersion> reactive variant if you want to stay end-to-end reactive). The whole block becomes:

return Mono.fromCallable(() -> versionService.findByIds(versionIds, workspaceId))
        .subscribeOn(Schedulers.boundedElastic())
        .map(versions -> versions.stream()
                .filter(v -> CollectionUtils.isNotEmpty(v.evaluators()))
                .collect(Collectors.toMap(DatasetVersion::id, DatasetVersion::evaluators)));

— one blocking unit, no try/catch, no mutable accumulator, no Map.copyOf. Worth doing as one cleanup rather than four separate fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend java Pull requests that update Java code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants