Skip to content

[fix](variant) fix variant stats merge#61090

Merged
eldenmoon merged 1 commit intoapache:masterfrom
eldenmoon:var-stats
Mar 6, 2026
Merged

[fix](variant) fix variant stats merge#61090
eldenmoon merged 1 commit intoapache:masterfrom
eldenmoon:var-stats

Conversation

@eldenmoon
Copy link
Member

introduced by #60847

Copilot AI review requested due to automatic review settings March 5, 2026 15:53
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@eldenmoon
Copy link
Member Author

run buildall

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect/duplicated merging of VARIANT column statistics (sparse/doc-value maps) during segment read/init, which was introduced by #60847’s NestedGroup/variant meta changes.

Changes:

  • Update VariantStatistics to support additive merging of sparse and doc-value stats from per-(bucket) column metas.
  • Adjust VariantColumnReader::init() to source sparse/doc stats from their respective column metas (and only read scalar flags from root stats).
  • Add a new regression suite/output for typed-paths-to-sparse with default_variant_sparse_hash_shard_count = 1 and compaction verification.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
regression-test/suites/variant_p0/predefine/predefined_typed_to_sparse_1shard.groovy New regression suite covering typed-to-sparse behavior with 1 shard and post-compaction checks.
regression-test/data/variant_p0/predefine/predefined_typed_to_sparse_1shard.out Golden output for the new regression suite queries (pre/post compaction).
be/src/vec/common/variant_util.cpp Skips an output-schema extracted-column invariant check under BE_TEST.
be/src/olap/rowset/segment_v2/variant/variant_statistics.h Replaces full overwrite-from-PB with additive merge helpers for sparse/doc-value maps.
be/src/olap/rowset/segment_v2/variant/variant_column_reader.cpp Changes stats initialization to merge from sparse/doc metas and avoid root-map duplication.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 948 to +962
@@ -958,6 +959,7 @@ Status VariantCompactionUtil::check_path_stats(const std::vector<RowsetSharedPtr
column->name());
}
}
#endif
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping the output-rowset extracted-column invariant check in #ifndef BE_TEST makes unit-test builds skip a correctness validation that still runs in production builds. This can mask real regressions (e.g. new extracted columns like NestedGroup-related columns) and reduce the value of the tests. Prefer keeping the check enabled and extending the allow-list/conditions explicitly for any newly legitimate extracted columns, instead of disabling it under BE_TEST.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +50
def load_json_data = {table_name, file_name ->
// load the json data
streamLoad {
table "${table_name}"

// set http request header params
set 'read_json_by_line', 'true'
set 'format', 'json'
set 'max_filter_ratio', '0.1'
set 'memtable_on_sink_node', 'true'
file file_name // import json file
time 10000 // limit inflight 10s

// if declared a check callback, the default check condition will ignore.
// So you must check all condition

check { result, exception, startTime, endTime ->
if (exception != null) {
throw exception
}
logger.info("Stream load ${file_name} result: ${result}".toString())
def json = parseJson(result)
assertEquals("success", json.Status.toLowerCase())
// assertEquals(json.NumberTotalRows, json.NumberLoadedRows + json.NumberUnselectedRows)
assertTrue(json.NumberLoadedRows > 0 && json.LoadBytes > 0)
}
}
}

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_json_data closure is defined but never used in this suite. Keeping unused streamLoad logic makes the test harder to read and maintain; please remove it or use it for data loading (as in predefined_typed_to_sparse.groovy).

Suggested change
def load_json_data = {table_name, file_name ->
// load the json data
streamLoad {
table "${table_name}"
// set http request header params
set 'read_json_by_line', 'true'
set 'format', 'json'
set 'max_filter_ratio', '0.1'
set 'memtable_on_sink_node', 'true'
file file_name // import json file
time 10000 // limit inflight 10s
// if declared a check callback, the default check condition will ignore.
// So you must check all condition
check { result, exception, startTime, endTime ->
if (exception != null) {
throw exception
}
logger.info("Stream load ${file_name} result: ${result}".toString())
def json = parseJson(result)
assertEquals("success", json.Status.toLowerCase())
// assertEquals(json.NumberTotalRows, json.NumberLoadedRows + json.NumberUnselectedRows)
assertTrue(json.NumberLoadedRows > 0 && json.LoadBytes > 0)
}
}
}

Copilot uses AI. Check for mistakes.
}
}

def tableName = "test_predefine_typed_to_sparse"
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suite uses the same table name (test_predefine_typed_to_sparse) as predefined_typed_to_sparse.groovy. If the regression runner executes suites concurrently (or reuses the same DB between suites), the DROP/CREATE/INSERT here can race and cause flaky failures. Please use a table name unique to this suite (e.g. include _1shard).

Suggested change
def tableName = "test_predefine_typed_to_sparse"
def tableName = "test_predefine_typed_to_sparse_1shard"

Copilot uses AI. Check for mistakes.
@doris-robot
Copy link

TPC-H: Total hot run time: 27809 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d6a73048fff58639017f72bcea369c4cd9fe3dde, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17694	4496	4308	4308
q2	q3	10657	811	527	527
q4	4678	374	253	253
q5	7572	1196	1013	1013
q6	173	176	146	146
q7	769	841	664	664
q8	9304	1483	1338	1338
q9	4918	4797	4757	4757
q10	6251	1918	1647	1647
q11	467	264	242	242
q12	704	573	476	476
q13	18051	2942	2200	2200
q14	232	225	217	217
q15	911	798	802	798
q16	736	720	672	672
q17	718	848	428	428
q18	6098	5406	5325	5325
q19	1186	1020	617	617
q20	502	516	387	387
q21	4561	2227	1524	1524
q22	379	332	270	270
Total cold run time: 96561 ms
Total hot run time: 27809 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4718	4536	4495	4495
q2	q3	3896	4411	3855	3855
q4	884	1177	766	766
q5	4104	4373	4414	4373
q6	176	176	139	139
q7	1792	1626	1502	1502
q8	2686	2916	2601	2601
q9	7371	7285	7603	7285
q10	3806	4031	3590	3590
q11	499	426	489	426
q12	571	613	457	457
q13	2694	3227	2340	2340
q14	288	300	276	276
q15	881	826	821	821
q16	740	783	727	727
q17	1193	1400	1317	1317
q18	7215	6843	6579	6579
q19	869	868	960	868
q20	2045	2177	2132	2132
q21	3994	3506	3398	3398
q22	471	448	379	379
Total cold run time: 50893 ms
Total hot run time: 48326 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 153493 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d6a73048fff58639017f72bcea369c4cd9fe3dde, data reload: false

query5	4344	626	509	509
query6	328	227	208	208
query7	4210	481	269	269
query8	368	243	247	243
query9	8719	2743	2742	2742
query10	503	376	332	332
query11	7250	5787	5651	5651
query12	192	135	131	131
query13	1267	444	358	358
query14	5828	3796	3637	3637
query14_1	2824	2808	2821	2808
query15	215	207	176	176
query16	973	478	449	449
query17	1105	720	621	621
query18	2440	451	376	376
query19	207	205	174	174
query20	134	129	126	126
query21	228	146	121	121
query22	4859	5049	4826	4826
query23	16592	16033	15842	15842
query23_1	16031	15831	15716	15716
query24	7847	1709	1237	1237
query24_1	1227	1237	1244	1237
query25	560	466	423	423
query26	1245	254	150	150
query27	2808	499	285	285
query28	4507	1892	1897	1892
query29	870	565	504	504
query30	307	247	208	208
query31	1346	1286	1222	1222
query32	89	72	70	70
query33	529	317	280	280
query34	937	924	557	557
query35	641	690	599	599
query36	1058	1113	1008	1008
query37	131	97	84	84
query38	2933	2923	2864	2864
query39	931	880	865	865
query39_1	833	827	835	827
query40	226	158	137	137
query41	62	60	59	59
query42	304	292	299	292
query43	236	243	219	219
query44	
query45	196	192	185	185
query46	875	984	618	618
query47	2106	2156	2046	2046
query48	315	329	234	234
query49	625	457	371	371
query50	684	273	222	222
query51	4169	4075	4082	4075
query52	288	296	278	278
query53	291	333	280	280
query54	314	268	263	263
query55	93	98	84	84
query56	316	326	309	309
query57	1369	1341	1278	1278
query58	286	278	306	278
query59	1370	1461	1309	1309
query60	346	339	306	306
query61	151	150	147	147
query62	625	591	546	546
query63	309	279	281	279
query64	5113	1278	1001	1001
query65	
query66	1465	467	350	350
query67	16443	16527	16423	16423
query68	
query69	383	305	294	294
query70	994	954	972	954
query71	341	308	307	307
query72	2817	2733	2596	2596
query73	542	550	316	316
query74	9994	9998	9719	9719
query75	2896	2787	2479	2479
query76	2285	1039	675	675
query77	378	387	318	318
query78	11199	11377	10705	10705
query79	1167	789	614	614
query80	850	650	566	566
query81	521	282	253	253
query82	1339	155	117	117
query83	352	273	252	252
query84	254	127	110	110
query85	1009	552	445	445
query86	392	320	294	294
query87	3205	3095	2968	2968
query88	3520	2661	2638	2638
query89	420	368	338	338
query90	1800	179	179	179
query91	171	166	138	138
query92	78	74	72	72
query93	916	824	509	509
query94	501	322	255	255
query95	583	390	328	328
query96	636	516	227	227
query97	2471	2524	2444	2444
query98	233	225	213	213
query99	996	1010	910	910
Total cold run time: 234102 ms
Total hot run time: 153493 ms

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (10/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.56% (19647/37382)
Line Coverage 36.18% (183407/506951)
Region Coverage 32.49% (142340/438084)
Branch Coverage 33.43% (61721/184640)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 83.33% (10/12) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 63.23% (23145/36604)
Line Coverage 46.51% (235046/505412)
Region Coverage 43.53% (192510/442232)
Branch Coverage 44.64% (82680/185216)

1 similar comment
@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 83.33% (10/12) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 63.23% (23145/36604)
Line Coverage 46.51% (235046/505412)
Region Coverage 43.53% (192510/442232)
Branch Coverage 44.64% (82680/185216)

Copy link
Contributor

@csun5285 csun5285 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

PR approved by anyone and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 6, 2026
@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 83.33% (10/12) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 63.28% (23162/36604)
Line Coverage 46.53% (235173/505412)
Region Coverage 43.58% (192728/442232)
Branch Coverage 44.69% (82765/185216)

@eldenmoon eldenmoon merged commit 1babe06 into apache:master Mar 6, 2026
35 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants