[OPIK-6218] [BE] fix: compare endpoint missing version-level evaluators#6487
[OPIK-6218] [BE] fix: compare endpoint missing version-level evaluators#6487itamargolan wants to merge 4 commits intomainfrom
Conversation
The compare endpoint only returned item-level evaluators from ClickHouse, never version-level evaluators from MySQL. This caused items with no item-level assertions to show "Skipped" even when the version had assertions. Now fetches the experiment's actual dataset version evaluators and merges them into each item (version-level first, then item-level), matching the experiment runner behavior in TestSuiteAssertionSampler. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Build per-version evaluator map instead of picking arbitrary experiment - Each item gets its own version's evaluators merged with item-level ones - Flow dataset_version_id through ClickHouse query → mapper → DTO - Demote DAO INFO log to DEBUG (service already logs at INFO) - Collapse 3 empty-evaluator tests into @ParameterizedTest Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend Tests - Integration Group 1414 tests 414 ✅ 15m 57s ⏱️ Results for commit 8213f37. ♻️ This comment has been updated with latest results. |
…apper ClickHouse FixedString(36) returns 36 null bytes on LEFT JOIN miss, which passes isBlank() but fails UUID.fromString(). Use trim()+isEmpty() since trim() strips chars <= including \0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| for (UUID versionId : versionIds) { | ||
| try { | ||
| DatasetVersion version = versionService.getVersionById(workspaceId, |
There was a problem hiding this comment.
N+1 query — biggest concern in this PR.
This loop calls versionService.getVersionById(...) once per version, and each call opens its own inTransaction(READ_ONLY, ...). On the compare endpoint, experimentIds is user-driven (selecting many experiments to compare), so we can easily issue 5–10+ sequential MySQL round-trips per page request.
DatasetVersionService already exposes a batch API designed exactly for this:
List<DatasetVersion> findByIds(Collection<UUID> versionIds, String workspaceId); // line 160Suggested replacement:
return Mono.fromCallable(() -> versionService.findByIds(versionIds, workspaceId))
.subscribeOn(Schedulers.boundedElastic())
.map(versions -> versions.stream()
.filter(v -> CollectionUtils.isNotEmpty(v.evaluators()))
.collect(Collectors.toMap(DatasetVersion::id, DatasetVersion::evaluators)));Versions absent from the response are simply skipped — same behavior as the current NotFoundException catch, no warning needed because missing-after-delete is expected here.
| return versionDao.getItemsWithExperimentItems(criteria, page, size, resolvedFallbackVersionId) | ||
| .defaultIfEmpty(DatasetItemPage.empty(page, sortingFactory.getSortableFields())); | ||
| // Build a per-version evaluator map so each item gets its own version's evaluators. | ||
| Mono<Map<UUID, List<EvaluatorItem>>> evaluatorsByVersionMono = experimentDAO.getByIds(criteria.experimentIds()) |
There was a problem hiding this comment.
ExperimentDAO.getByIds is heavier than what's needed here.
getByIds runs the full experiment aggregation pipeline (getAggregationBranchCounts, mapToDto, all the joined data) just to read each experiment's datasetVersionId. Since the compare endpoint is interactive, the extra cost shows up on every paginated read.
There's already a near-perfect lighter alternative — ExperimentDAO.getExecutionPoliciesByIds (line 2270) returns ExperimentPolicyInfo(experimentId, policy, datasetVersionId). Two options:
- Cheapest: reuse
getExecutionPoliciesByIdsand ignore thepolicyfield — zero new code. - Add a tiny dedicated method
getDatasetVersionIdsByIds(Set<UUID>)if reusing the policy method feels semantically wrong.
Either way, the win is avoiding the aggregation query just to extract one column.
| versionId, criteria.datasetId()); | ||
| } | ||
| } | ||
| return Map.copyOf(map); |
There was a problem hiding this comment.
Nit: Map.copyOf(map) is unnecessary here. The map is local to the callable, isn't shared after return, and nothing mutates it later — returning map directly is fine. (If you switch to the findByIds batch approach above, this disappears anyway.)
| @JsonView({ | ||
| DatasetItem.View.Public.class}) @Schema(accessMode = Schema.AccessMode.READ_ONLY) String lastUpdatedBy){ | ||
| DatasetItem.View.Public.class}) @Schema(accessMode = Schema.AccessMode.READ_ONLY) String lastUpdatedBy, | ||
| @JsonIgnore UUID datasetVersionId){ |
There was a problem hiding this comment.
Internal control field on a public API record.
Adding datasetVersionId to the DatasetItem record (even with @JsonIgnore) has two side effects worth being deliberate about:
- Records auto-generate
equals/hashCode/toStringfrom all components, so any test or dedup logic that comparesDatasetIteminstances now silently uses an additional discriminator. Worth grepping callers to confirm none rely on equality. - It leaks an internal carrier field onto the public DTO. The OpenAPI/Swagger generator may still pick this up — consider also adding
@Schema(hidden = true)to suppress it from the spec.
Per the project's "comment the why, not the what" rule, a one-line comment on the field stating why it lives here (carrier for service-side per-version evaluator merge) would help future readers. Cleaner alternatives (separate internal carrier type, an attribute map on a wrapper) are likely overkill — but please call the choice out explicitly.
thiagohora
left a comment
There was a problem hiding this comment.
Follow-up: posting one more comment on the broader anti-pattern in the Mono.fromCallable block — the issues there compound (sequential blocking, exception-as-control-flow, mutable accumulator, defensive Map.copyOf) and are best fixed as a single cleanup rather than separately.
| return Mono.fromCallable(() -> { | ||
| var map = new HashMap<UUID, List<EvaluatorItem>>(); | ||
| for (UUID versionId : versionIds) { | ||
| try { | ||
| DatasetVersion version = versionService.getVersionById(workspaceId, | ||
| criteria.datasetId(), versionId); | ||
| if (CollectionUtils.isNotEmpty(version.evaluators())) { | ||
| map.put(versionId, version.evaluators()); | ||
| } | ||
| } catch (NotFoundException e) { | ||
| log.warn("Version '{}' not found for dataset '{}'", | ||
| versionId, criteria.datasetId()); | ||
| } | ||
| } | ||
| return Map.copyOf(map); | ||
| }).subscribeOn(Schedulers.boundedElastic()); |
There was a problem hiding this comment.
Stepping back — this whole Mono.fromCallable block is doing several things that, taken together, count as a reactive anti-pattern:
- Sequential blocking inside
fromCallable. EachgetVersionByIdis a blocking JDBI round-trip, and we run them in aforloop on a singleboundedElasticthread. That thread is parked for the entire chain of N queries — exactly whatboundedElasticis meant to prevent. The N+1 cost compounds with thread-occupancy cost. - Exceptions used for control flow.
NotFoundExceptionis the expected outcome when a version was deleted (test suite removed). Catching + warn-logging per iteration is noisy and wasteful; in a batch result, "missing" is just "not in the returned list." - Mutable
HashMapaccumulator in a reactive lambda. Works because it's local, but indicates we're imperatively bolting a loop into the reactive pipeline rather than expressing it as data flow. Map.copyOfon a local map — already flagged separately, but it's of a piece with the broader pattern: ceremony around a structure that shouldn't exist in the first place.
All four collapse if we replace the loop with versionService.findByIds(versionIds, workspaceId) (or, even better, a Flux<DatasetVersion> reactive variant if you want to stay end-to-end reactive). The whole block becomes:
return Mono.fromCallable(() -> versionService.findByIds(versionIds, workspaceId))
.subscribeOn(Schedulers.boundedElastic())
.map(versions -> versions.stream()
.filter(v -> CollectionUtils.isNotEmpty(v.evaluators()))
.collect(Collectors.toMap(DatasetVersion::id, DatasetVersion::evaluators)));— one blocking unit, no try/catch, no mutable accumulator, no Map.copyOf. Worth doing as one cleanup rather than four separate fixes.
Details
The compare endpoint (
GET /api/v1/datasets/{id}/items/experiments/items) only returned item-level evaluators from ClickHousedataset_item_versions, never version-level evaluators from MySQL. This caused dataset items with no item-level assertions to show "Skipped - No assertions defined" in the UI, even when the dataset version had version-level assertions that applied to all items.The fix builds a per-version evaluator map by looking up each experiment's
dataset_version_id, fetching the corresponding version evaluators from MySQL, and merging them into each item — version-level first, then item-level — matchingTestSuiteAssertionSamplerbehavior. Items from different dataset versions each get their own version's evaluators.ExperimentDAOdependency toDatasetItemServiceImplto resolve experiment → dataset version mappingdataset_version_idthrough ClickHouse query, result mapper, andDatasetItemDTO (as@JsonIgnore) so each item knows which version it belongs toversionId→legacyFallbackVersionIdin the DAO interface to clarify its purpose (only used for pre-versioning experiments)Mono.fromCallable+subscribeOn(Schedulers.boundedElastic())Change checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
mvn test -Dtest="DatasetItemServiceApplyVersionEvaluatorsTest"— unit tests covering bothapplyVersionEvaluators(parameterized empty/null/empty-list variants, merge ordering, multiple version evaluators, field preservation) andapplyVersionEvaluatorsToPage(per-version routing, unknown version skipping, null version skipping, empty map passthrough)mvn test -Dtest="DatasetItemServiceFilterDataTest"— 8 existing tests pass with updated constructormvn spotless:check— formatting cleanscripts/dev-runner.sh --restart:Documentation
No documentation changes needed — this is a backend bug fix with no API contract changes.