[fix](variant) fix variant stats merge#61090
Conversation
introduced by apache#60847
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Fixes incorrect/duplicated merging of VARIANT column statistics (sparse/doc-value maps) during segment read/init, which was introduced by #60847’s NestedGroup/variant meta changes.
Changes:
- Update
VariantStatisticsto support additive merging of sparse and doc-value stats from per-(bucket) column metas. - Adjust
VariantColumnReader::init()to source sparse/doc stats from their respective column metas (and only read scalar flags from root stats). - Add a new regression suite/output for typed-paths-to-sparse with
default_variant_sparse_hash_shard_count = 1and compaction verification.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/variant_p0/predefine/predefined_typed_to_sparse_1shard.groovy | New regression suite covering typed-to-sparse behavior with 1 shard and post-compaction checks. |
| regression-test/data/variant_p0/predefine/predefined_typed_to_sparse_1shard.out | Golden output for the new regression suite queries (pre/post compaction). |
| be/src/vec/common/variant_util.cpp | Skips an output-schema extracted-column invariant check under BE_TEST. |
| be/src/olap/rowset/segment_v2/variant/variant_statistics.h | Replaces full overwrite-from-PB with additive merge helpers for sparse/doc-value maps. |
| be/src/olap/rowset/segment_v2/variant/variant_column_reader.cpp | Changes stats initialization to merge from sparse/doc metas and avoid root-map duplication. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -958,6 +959,7 @@ Status VariantCompactionUtil::check_path_stats(const std::vector<RowsetSharedPtr | |||
| column->name()); | |||
| } | |||
| } | |||
| #endif | |||
There was a problem hiding this comment.
Wrapping the output-rowset extracted-column invariant check in #ifndef BE_TEST makes unit-test builds skip a correctness validation that still runs in production builds. This can mask real regressions (e.g. new extracted columns like NestedGroup-related columns) and reduce the value of the tests. Prefer keeping the check enabled and extending the allow-list/conditions explicitly for any newly legitimate extracted columns, instead of disabling it under BE_TEST.
| def load_json_data = {table_name, file_name -> | ||
| // load the json data | ||
| streamLoad { | ||
| table "${table_name}" | ||
|
|
||
| // set http request header params | ||
| set 'read_json_by_line', 'true' | ||
| set 'format', 'json' | ||
| set 'max_filter_ratio', '0.1' | ||
| set 'memtable_on_sink_node', 'true' | ||
| file file_name // import json file | ||
| time 10000 // limit inflight 10s | ||
|
|
||
| // if declared a check callback, the default check condition will ignore. | ||
| // So you must check all condition | ||
|
|
||
| check { result, exception, startTime, endTime -> | ||
| if (exception != null) { | ||
| throw exception | ||
| } | ||
| logger.info("Stream load ${file_name} result: ${result}".toString()) | ||
| def json = parseJson(result) | ||
| assertEquals("success", json.Status.toLowerCase()) | ||
| // assertEquals(json.NumberTotalRows, json.NumberLoadedRows + json.NumberUnselectedRows) | ||
| assertTrue(json.NumberLoadedRows > 0 && json.LoadBytes > 0) | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
load_json_data closure is defined but never used in this suite. Keeping unused streamLoad logic makes the test harder to read and maintain; please remove it or use it for data loading (as in predefined_typed_to_sparse.groovy).
| def load_json_data = {table_name, file_name -> | |
| // load the json data | |
| streamLoad { | |
| table "${table_name}" | |
| // set http request header params | |
| set 'read_json_by_line', 'true' | |
| set 'format', 'json' | |
| set 'max_filter_ratio', '0.1' | |
| set 'memtable_on_sink_node', 'true' | |
| file file_name // import json file | |
| time 10000 // limit inflight 10s | |
| // if declared a check callback, the default check condition will ignore. | |
| // So you must check all condition | |
| check { result, exception, startTime, endTime -> | |
| if (exception != null) { | |
| throw exception | |
| } | |
| logger.info("Stream load ${file_name} result: ${result}".toString()) | |
| def json = parseJson(result) | |
| assertEquals("success", json.Status.toLowerCase()) | |
| // assertEquals(json.NumberTotalRows, json.NumberLoadedRows + json.NumberUnselectedRows) | |
| assertTrue(json.NumberLoadedRows > 0 && json.LoadBytes > 0) | |
| } | |
| } | |
| } |
| } | ||
| } | ||
|
|
||
| def tableName = "test_predefine_typed_to_sparse" |
There was a problem hiding this comment.
This suite uses the same table name (test_predefine_typed_to_sparse) as predefined_typed_to_sparse.groovy. If the regression runner executes suites concurrently (or reuses the same DB between suites), the DROP/CREATE/INSERT here can race and cause flaky failures. Please use a table name unique to this suite (e.g. include _1shard).
| def tableName = "test_predefine_typed_to_sparse" | |
| def tableName = "test_predefine_typed_to_sparse_1shard" |
TPC-H: Total hot run time: 27809 ms |
TPC-DS: Total hot run time: 153493 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by anyone and no changes requested. |
|
PR approved by at least one committer and no changes requested. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
introduced by #60847