[OPIK-4518] [BE] Restore stable dataset item IDs in versioned API#5370
[OPIK-4518] [BE] Restore stable dataset item IDs in versioned API#5370
Conversation
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemService.java
Show resolved
Hide resolved
Backend Tests Results 439 files 439 suites 1h 6m 52s ⏱️ Results for commit 16c7883. ♻️ This comment has been updated with latest results. |
387959d to
7dceef3
Compare
7dceef3 to
3611291
Compare
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemService.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemService.java
Show resolved
Hide resolved
3611291 to
d168de1
Compare
apps/opik-backend/src/main/java/com/comet/opik/api/resources/v1/priv/DatasetsResource.java
Show resolved
Hide resolved
d168de1 to
16c7883
Compare
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemService.java
Show resolved
Hide resolved
16c7883 to
211832c
Compare
Backend Tests - Unit Tests1 433 tests 1 431 ✅ 54s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 5110 tests 110 ✅ 2m 19s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 11159 tests 157 ✅ 2m 42s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 7244 tests 244 ✅ 1m 59s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 16 18 files 18 suites 2m 13s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 15 31 files 31 suites 3m 2s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 8288 tests 288 ✅ 4m 16s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 12186 tests 185 ✅ 3m 18s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 61 129 tests 1 129 ✅ 6m 41s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 10 22 files 22 suites 6m 50s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
Backend Tests - Integration Group 14185 tests 185 ✅ 1m 41s ⏱️ Results for commit b1c2367. ♻️ This comment has been updated with latest results. |
The dual-join approach handles both old (physical row ID) and new (stable dataset_item_id) experiment items at query time, making the manual migration unnecessary. Co-Authored-By: Claude Opus 4.6 <[email protected]>
All queries joining experiment_items.dataset_item_id with dataset_item_versions now match on both physical row id AND stable dataset_item_id, so old experiment items (storing row IDs) and new ones (storing stable IDs) both resolve correctly without requiring a data migration. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Avoid collision with main's 000065-000069 added since last rebase. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Main now has 000070-000072, so bump to next available. Co-Authored-By: Claude Opus 4.6 <[email protected]>
ClickHouse FixedString(36) columns return null bytes (not SQL NULL) on LEFT JOIN miss, so COALESCE never falls through. Use if(div.id != '') to check whether the join actually matched. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Defer the dataset_item_id projection to a follow-up. The skip indexes (000073) provide sufficient coverage for now. The projection can be added later if large-version performance requires it. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Consolidate OR-expanded filter conditions that referenced the same CTE twice into single arrayJoin([col1, col2]) calls. Each CTE is now evaluated exactly once per usage site. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Use idx_{table}_{column} pattern: idx_dataset_item_versions_dataset_item_id_bf
and idx_dataset_item_versions_dataset_item_id_minmax.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Revert to main's simpler query that resolves trace_ids directly from experiment_items without joining dataset_item_versions. The dual-join treatment is unnecessary here — output column discovery only needs trace_ids, not dataset item mapping. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add LIMIT 1 BY dataset_item_id to dataset_items_agg_resolved, dataset_items_aggr_resolved, and ExperimentAggregatesDAO's dataset_item_versions_resolved CTEs. Without this, the OR-condition joins could match one experiment item to multiple dataset item rows from different versions, inflating groupArray results. - Remove orphaned Javadoc from deleted validateMappingsBelongToSameDataset - Rename deletedRowIds to deletedIds (now holds stable IDs, not row IDs) Co-Authored-By: Claude Opus 4.6 <[email protected]>
The dataset_item_filters in item_agg selected only physical id, missing new experiment items that store stable dataset_item_id. Use arrayJoin([id, dataset_item_id]) and add two-level dedup. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Main added 000073_add_minmax_index_trace_threads_last_updated_at. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…S row_id All aggregated CTEs (dataset_items_agg_resolved, dataset_items_aggr_resolved, dataset_item_versions_resolved) now use the same convention as dataset_items_resolved: dataset_item_id AS id (stable), id AS row_id (physical). Updated all join conditions, arrayJoin filters, and GROUP BY references. Co-Authored-By: Claude Opus 4.6 <[email protected]>
… update Verify that delete and batch update requests with item IDs spanning multiple datasets (without explicit datasetId) return 400. Co-Authored-By: Claude Opus 4.6 <[email protected]>
97393f0 to
47b068f
Compare
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemResultMapper.java
Show resolved
Hide resolved
.../resources/liquibase/db-app-analytics/migrations/000074_add_dataset_item_id_skip_indexes.sql
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
thiagohora
left a comment
There was a problem hiding this comment.
Two things worth addressing before merging.
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
|
🔄 Test environment deployment process has started Phase 1: Deploying base version You can monitor the progress here. |
|
✅ Test environment is now available! To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml) Access Information
The deployment has completed successfully and the version has been verified. |
|
🌙 Nightly cleanup: The test environment for this PR ( |
|
🔄 Test environment deployment process has started Phase 1: Deploying base version You can monitor the progress here. |
|
✅ Test environment is now available! To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml) Access Information
The deployment has completed successfully and the version has been verified. |
Details
Makes the
idfield in the DatasetItem API return the stabledataset_item_idinstead of the version-specific physical ClickHouse rowid. Previously,idchanged every time an item was included in a new dataset version, which was confusing — users expectidto be a stable identifier. Nowidremains the same across versions, and the deprecateddataset_item_idfield always equalsid.Key changes:
dataset_item_id AS idso the stable ID is returned as the API'sidfieldarrayJoin([id, row_id])pattern avoids duplicate CTE evaluation in ClickHouseLIMIT 1 BY idfor ReplacingMergeTree, outerLIMIT 1 BY dataset_item_idfor cross-version)mapRowIdsToDatasetItemIdsindirection layer, simplifying the service layer significantly (-269 lines net)if(div.id != '', ...)instead ofCOALESCEdue to ClickHouse FixedString NULL behavior on LEFT JOINsdataset_item_idfor queries that lack full primary key prefixresolveDatasetIdsFromItemIdswith explicit multi-dataset validationChange checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
mvn test -Dtest="DatasetVersionResourceTest"— all tests pass including:putItems__whenSameItemInMultipleVersions__thenStableIdsAcrossVersions)deleteItems__whenItemIdsSpanMultipleDatasets__thenReturn400,batchUpdate__whenItemIdsSpanMultipleDatasets__thenReturn400)datasetItemId == idinvariant assertionmvn test -Dtest="TracesResourceTest$ExperimentItemReferenceTest"— all 5 tests pass (experiment reference resolution via TraceDAO)mvn test -Dtest="ExperimentsResourceTest$InsertExperimentItems"— workspace validation test passesDatasetsResourceTestsuite passes (experiment comparison view, stats, output columns, sorting, filtering)Documentation
API schema descriptions updated in-code for
DatasetItem.id,DatasetItem.datasetItemId(deprecated),DatasetItemEdit.id, andDatasetsResourceoperation description. No external documentation changes needed.