[OPIK-6218] [BE] fix: compare endpoint missing version-level evaluators by itamargolan · Pull Request #6487 · comet-ml/opik

itamargolan · 2026-04-24T18:45:54Z

Details

The compare endpoint (GET /api/v1/datasets/{id}/items/experiments/items) only returned item-level evaluators from ClickHouse dataset_item_versions, never version-level evaluators from MySQL. This caused dataset items with no item-level assertions to show "Skipped - No assertions defined" in the UI, even when the dataset version had version-level assertions that applied to all items.

The fix builds a per-version evaluator map by looking up each experiment's dataset_version_id, fetching the corresponding version evaluators from MySQL, and merging them into each item — version-level first, then item-level — matching TestSuiteAssertionSampler behavior. Items from different dataset versions each get their own version's evaluators.

Added ExperimentDAO dependency to DatasetItemServiceImpl to resolve experiment → dataset version mapping
Flowed dataset_version_id through ClickHouse query, result mapper, and DatasetItem DTO (as @JsonIgnore) so each item knows which version it belongs to
Renamed versionId → legacyFallbackVersionId in the DAO interface to clarify its purpose (only used for pre-versioning experiments)
Demoted DAO-level log from INFO to DEBUG (service already logs at INFO)
Wrapped blocking MySQL call in Mono.fromCallable + subscribeOn(Schedulers.boundedElastic())

Change checklist

User facing
Documentation update

Issues

OPIK-6218

AI-WATERMARK

AI-WATERMARK: yes

Tools: Claude Code (CLI)
Model(s): Claude Opus 4.6
Scope: implementation, tests
Human verification: code reviewed, tests verified, UI tested locally

Testing

mvn test -Dtest="DatasetItemServiceApplyVersionEvaluatorsTest" — unit tests covering both applyVersionEvaluators (parameterized empty/null/empty-list variants, merge ordering, multiple version evaluators, field preservation) and applyVersionEvaluatorsToPage (per-version routing, unknown version skipping, null version skipping, empty map passthrough)
mvn test -Dtest="DatasetItemServiceFilterDataTest" — 8 existing tests pass with updated constructor
mvn spotless:check — formatting clean
Local UI testing via scripts/dev-runner.sh --restart:
- Playground: run experiment on test suite v3 with version-level assertions → all items show evaluator results
- Compare experiments page: items display correct evaluators

Documentation

No documentation changes needed — this is a backend bug fix with no API contract changes.

The compare endpoint only returned item-level evaluators from ClickHouse, never version-level evaluators from MySQL. This caused items with no item-level assertions to show "Skipped" even when the version had assertions. Now fetches the experiment's actual dataset version evaluators and merges them into each item (version-level first, then item-level), matching the experiment runner behavior in TestSuiteAssertionSampler. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Build per-version evaluator map instead of picking arbitrary experiment - Each item gets its own version's evaluators merged with item-level ones - Flow dataset_version_id through ClickHouse query → mapper → DTO - Demote DAO INFO log to DEBUG (service already logs at INFO) - Collapse 3 empty-evaluator tests into @ParameterizedTest Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-27T11:39:17Z

Backend Tests - Integration Group 1

414 tests 414 ✅ 15m 57s ⏱️
23 suites 0 💤
23 files 0 ❌

Results for commit 8213f37.

♻️ This comment has been updated with latest results.

…apper ClickHouse FixedString(36) returns 36 null bytes on LEFT JOIN miss, which passes isBlank() but fails UUID.fromString(). Use trim()+isEmpty() since trim() strips chars <= including \0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

thiagohora

Overall direction looks right — merging version-level evaluators on the service after the ClickHouse query (rather than cross-store joining) is the sensible call, and the merge order matches TestSuiteAssertionSampler (verified at lines 213–217).

thiagohora · 2026-04-28T16:36:02Z

+                        for (UUID versionId : versionIds) {
+                            try {
+                                DatasetVersion version = versionService.getVersionById(workspaceId,


N+1 query — biggest concern in this PR.

This loop calls versionService.getVersionById(...) once per version, and each call opens its own inTransaction(READ_ONLY, ...). On the compare endpoint, experimentIds is user-driven (selecting many experiments to compare), so we can easily issue 5–10+ sequential MySQL round-trips per page request.

DatasetVersionService already exposes a batch API designed exactly for this:

List<DatasetVersion> findByIds(Collection<UUID> versionIds, String workspaceId); // line 160

Suggested replacement:

return Mono.fromCallable(() -> versionService.findByIds(versionIds, workspaceId)) .subscribeOn(Schedulers.boundedElastic()) .map(versions -> versions.stream() .filter(v -> CollectionUtils.isNotEmpty(v.evaluators())) .collect(Collectors.toMap(DatasetVersion::id, DatasetVersion::evaluators)));

Versions absent from the response are simply skipped — same behavior as the current NotFoundException catch, no warning needed because missing-after-delete is expected here.

thiagohora · 2026-04-28T16:37:27Z

-        return versionDao.getItemsWithExperimentItems(criteria, page, size, resolvedFallbackVersionId)
-                .defaultIfEmpty(DatasetItemPage.empty(page, sortingFactory.getSortableFields()));
+        // Build a per-version evaluator map so each item gets its own version's evaluators.
+        Mono<Map<UUID, List<EvaluatorItem>>> evaluatorsByVersionMono = experimentDAO.getByIds(criteria.experimentIds())


ExperimentDAO.getByIds is heavier than what's needed here.

getByIds runs the full experiment aggregation pipeline (getAggregationBranchCounts, mapToDto, all the joined data) just to read each experiment's datasetVersionId. Since the compare endpoint is interactive, the extra cost shows up on every paginated read.

There's already a near-perfect lighter alternative — ExperimentDAO.getExecutionPoliciesByIds (line 2270) returns ExperimentPolicyInfo(experimentId, policy, datasetVersionId). Two options:

Cheapest: reuse getExecutionPoliciesByIds and ignore the policy field — zero new code.

Add a tiny dedicated method getDatasetVersionIdsByIds(Set<UUID>) if reusing the policy method feels semantically wrong.

Either way, the win is avoiding the aggregation query just to extract one column.

thiagohora · 2026-04-28T16:37:37Z

+                                        versionId, criteria.datasetId());
+                            }
+                        }
+                        return Map.copyOf(map);


Nit: Map.copyOf(map) is unnecessary here. The map is local to the callable, isn't shared after return, and nothing mutates it later — returning map directly is fine. (If you switch to the findByIds batch approach above, this disappears anyway.)

thiagohora · 2026-04-28T16:37:54Z

        @JsonView({
-                DatasetItem.View.Public.class}) @Schema(accessMode = Schema.AccessMode.READ_ONLY) String lastUpdatedBy){
+                DatasetItem.View.Public.class}) @Schema(accessMode = Schema.AccessMode.READ_ONLY) String lastUpdatedBy,
+        @JsonIgnore UUID datasetVersionId){


Internal control field on a public API record.

Adding datasetVersionId to the DatasetItem record (even with @JsonIgnore) has two side effects worth being deliberate about:

Records auto-generate equals/hashCode/toString from all components, so any test or dedup logic that compares DatasetItem instances now silently uses an additional discriminator. Worth grepping callers to confirm none rely on equality.

It leaks an internal carrier field onto the public DTO. The OpenAPI/Swagger generator may still pick this up — consider also adding @Schema(hidden = true) to suppress it from the spec.

Per the project's "comment the why, not the what" rule, a one-line comment on the field stating why it lives here (carrier for service-side per-version evaluator merge) would help future readers. Cleaner alternatives (separate internal carrier type, an attribute map on a wrapper) are likely overkill — but please call the choice out explicitly.

thiagohora

Follow-up: posting one more comment on the broader anti-pattern in the Mono.fromCallable block — the issues there compound (sequential blocking, exception-as-control-flow, mutable accumulator, defensive Map.copyOf) and are best fixed as a single cleanup rather than separately.

thiagohora · 2026-04-28T16:39:01Z

+                    return Mono.fromCallable(() -> {
+                        var map = new HashMap<UUID, List<EvaluatorItem>>();
+                        for (UUID versionId : versionIds) {
+                            try {
+                                DatasetVersion version = versionService.getVersionById(workspaceId,
+                                        criteria.datasetId(), versionId);
+                                if (CollectionUtils.isNotEmpty(version.evaluators())) {
+                                    map.put(versionId, version.evaluators());
+                                }
+                            } catch (NotFoundException e) {
+                                log.warn("Version '{}' not found for dataset '{}'",
+                                        versionId, criteria.datasetId());
+                            }
+                        }
+                        return Map.copyOf(map);
+                    }).subscribeOn(Schedulers.boundedElastic());


Stepping back — this whole Mono.fromCallable block is doing several things that, taken together, count as a reactive anti-pattern:

Sequential blocking inside fromCallable. Each getVersionById is a blocking JDBI round-trip, and we run them in a for loop on a single boundedElastic thread. That thread is parked for the entire chain of N queries — exactly what boundedElastic is meant to prevent. The N+1 cost compounds with thread-occupancy cost.

Exceptions used for control flow. NotFoundException is the expected outcome when a version was deleted (test suite removed). Catching + warn-logging per iteration is noisy and wasteful; in a batch result, "missing" is just "not in the returned list."

Mutable HashMap accumulator in a reactive lambda. Works because it's local, but indicates we're imperatively bolting a loop into the reactive pipeline rather than expressing it as data flow.

Map.copyOf on a local map — already flagged separately, but it's of a piece with the broader pattern: ceremony around a structure that shouldn't exist in the first place.

All four collapse if we replace the loop with versionService.findByIds(versionIds, workspaceId) (or, even better, a Flux<DatasetVersion> reactive variant if you want to stay end-to-end reactive). The whole block becomes:

return Mono.fromCallable(() -> versionService.findByIds(versionIds, workspaceId)) .subscribeOn(Schedulers.boundedElastic()) .map(versions -> versions.stream() .filter(v -> CollectionUtils.isNotEmpty(v.evaluators())) .collect(Collectors.toMap(DatasetVersion::id, DatasetVersion::evaluators)));

— one blocking unit, no try/catch, no mutable accumulator, no Map.copyOf. Worth doing as one cleanup rather than four separate fixes.

itamargolan requested a review from a team as a code owner April 24, 2026 18:45

github-actions Bot assigned itamargolan Apr 24, 2026

github-actions Bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Apr 24, 2026

baz-reviewer Bot reviewed Apr 24, 2026

View reviewed changes

itamargolan and others added 2 commits April 24, 2026 16:00

fix: reuse getFallbackVersionId instead of duplicating lookup

5c2d90c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baz-reviewer Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemService.java

baz-reviewer Bot approved these changes Apr 27, 2026

View reviewed changes

thiagohora reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPIK-6218] [BE] fix: compare endpoint missing version-level evaluators#6487

[OPIK-6218] [BE] fix: compare endpoint missing version-level evaluators#6487
itamargolan wants to merge 4 commits intomainfrom
itamar/OPIK-6218-compare-endpoint-missing-version-evaluators

itamargolan commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

thiagohora left a comment •

edited

Loading

Uh oh!

thiagohora Apr 28, 2026

Uh oh!

thiagohora Apr 28, 2026

Uh oh!

thiagohora Apr 28, 2026

Uh oh!

thiagohora Apr 28, 2026

Uh oh!

thiagohora left a comment

Uh oh!

thiagohora Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

itamargolan commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Change checklist

Issues

AI-WATERMARK

Testing

Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend Tests - Integration Group 1

Uh oh!

thiagohora left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thiagohora Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

thiagohora Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

thiagohora Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

thiagohora Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

thiagohora left a comment

Choose a reason for hiding this comment

Uh oh!

thiagohora Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

itamargolan commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 27, 2026 •

edited

Loading

thiagohora left a comment •

edited

Loading