[Improvement](agg) opt for StringHashTable's for_each by BiteTheDDDDt · Pull Request #60941 · apache/doris

BiteTheDDDDt · 2026-03-02T02:57:19Z

What problem does this PR solve?

This pull request refactors and optimizes the batch operations for hash table insertion and lookup in the aggregation and distinct aggregation pipelines. The main improvement is the introduction of more efficient batch-processing methods for hash tables, especially for string-keyed tables, which now benefit from better cache locality through sub-table grouping. Several aggregation and distinct aggregation operators are updated to use these new batch methods, resulting in cleaner and more efficient code.

Batch hash table operations and code modernization:

Replaced per-row hash table insertion and lookup loops in aggregation and streaming aggregation operators (aggregation_sink_operator.cpp, aggregation_source_operator.cpp, streaming_aggregation_operator.cpp, distinct_streaming_aggregation_operator.cpp, and rec_cte_shared_state.h) with new batch methods: vectorized::lazy_emplace_batch, vectorized::lazy_emplace_batch_void, and vectorized::find_batch, improving performance and code clarity. [1] [2] [3] [4] [5] [6] [7] [8]
Updated the hash table method interface in hash_map_context.h by removing unnecessary ALWAYS_INLINE annotations and simplifying the prefetch, find, and lazy_emplace methods, making the code easier to maintain.

String-keyed hash table optimizations:

Introduced the StringKeySubTableGroups struct in hash_map_context.h, which groups row indices by string length for string-keyed hash tables. This enables batch processing by sub-table, improving cache locality and performance. [1] [2]
Integrated sub-table grouping into the string hash table initialization for aggregation (not join), so that batch operations can efficiently process each sub-table group.

Minor code cleanups:

Standardized lambda formatting and improved code consistency in several places, such as the handling of creators for null keys and exception handling in hash table operations. [1] [2]

These changes collectively improve the efficiency and maintainability of the aggregation pipeline, especially for workloads involving string keys.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2026-03-02T02:57:24Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

BiteTheDDDDt · 2026-03-02T02:57:30Z

run buildall

Copilot

Pull request overview

This PR introduces batch-oriented helpers to speed up StringHashMap operations by grouping rows by StringHashTable sub-map (m0/m1/m2/m3/m4/ms) and then processing each sub-map in a tight loop, avoiding per-row dispatch branching.

Changes:

Expose StringHashTable sub-maps via visit_submaps() (and direct accessors) to enable sub-map batch processing.
Add StringKeySubTableGroups plus new batch helpers (lazy_emplace_batch, lazy_emplace_batch_void, find_batch) to process StringHashMap rows grouped by sub-table.
Switch several aggregation pipeline operators to use the new batch helpers for emplace/find.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
be/src/vec/common/hash_table/string_hash_table.h	Adds `visit_submaps()` and sub-map accessors to iterate over StringHashTable sub-tables.
be/src/vec/common/hash_table/hash_map_context.h	Adds string sub-table grouping and new batch emplace/find helpers; adjusts prefetch behavior.
be/src/pipeline/rec_cte_shared_state.h	Replaces per-row emplace loop with `lazy_emplace_batch_void`.
be/src/pipeline/exec/streaming_aggregation_operator.cpp	Replaces per-row emplace loops with `lazy_emplace_batch` (and limit path).
be/src/pipeline/exec/distinct_streaming_aggregation_operator.cpp	Replaces per-row emplace loop with `lazy_emplace_batch_void`.
be/src/pipeline/exec/aggregation_source_operator.cpp	Replaces per-row emplace loop with `lazy_emplace_batch`.
be/src/pipeline/exec/aggregation_sink_operator.cpp	Replaces per-row emplace loops with `lazy_emplace_batch` and per-row find loop with `find_batch`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

be/src/pipeline/exec/streaming_aggregation_operator.cpp

be/src/pipeline/exec/aggregation_sink_operator.cpp

be/src/vec/common/hash_table/hash_map_context.h

BiteTheDDDDt · 2026-03-02T03:59:28Z

run buildall

BiteTheDDDDt · 2026-03-02T04:23:24Z

run buildall

update update

BiteTheDDDDt · 2026-03-02T05:05:11Z

run buildall

hello-stephen · 2026-03-02T08:05:43Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.13% (18708/36589)
Line Coverage	35.06% (177079/505105)
Region Coverage	31.00% (136994/441930)
Branch Coverage	32.23% (59636/185017)

BiteTheDDDDt · 2026-03-02T09:20:33Z

run buildall

hello-stephen · 2026-03-02T10:49:21Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.10% (18701/36594)
Line Coverage	35.04% (176967/505030)
Region Coverage	30.98% (136914/441875)
Branch Coverage	32.21% (59591/184982)

hello-stephen · 2026-03-02T11:00:55Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.10% (18701/36594)
Line Coverage	35.04% (176967/505030)
Region Coverage	30.98% (136914/441875)
Branch Coverage	32.21% (59591/184982)

hello-stephen · 2026-03-02T11:09:02Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.10% (18701/36594)
Line Coverage	35.04% (176967/505030)
Region Coverage	30.98% (136914/441875)
Branch Coverage	32.21% (59591/184982)

hello-stephen · 2026-03-02T11:17:52Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.12% (18710/36599)
Line Coverage	35.03% (176950/505176)
Region Coverage	30.96% (136825/441969)
Branch Coverage	32.22% (59623/185041)

hello-stephen · 2026-03-02T11:25:13Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.12% (18710/36599)
Line Coverage	35.03% (176950/505176)
Region Coverage	30.96% (136825/441969)
Branch Coverage	32.22% (59623/185041)

hello-stephen · 2026-03-02T14:51:12Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.12% (18710/36601)
Line Coverage	35.03% (176951/505194)
Region Coverage	30.96% (136826/441974)
Branch Coverage	32.22% (59623/185043)

hello-stephen · 2026-03-02T15:23:08Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	55.16% (20189/36601)
Line Coverage	38.47% (194340/505194)
Region Coverage	34.42% (152149/441974)
Branch Coverage	35.23% (65187/185043)

hello-stephen · 2026-03-02T15:29:48Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	56.91% (20828/36601)
Line Coverage	40.01% (202130/505194)
Region Coverage	36.37% (160739/441974)
Branch Coverage	37.09% (68640/185043)

hello-stephen · 2026-03-02T16:00:31Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	56.94% (20840/36601)
Line Coverage	40.04% (202295/505194)
Region Coverage	36.39% (160845/441974)
Branch Coverage	37.12% (68697/185043)

BiteTheDDDDt · 2026-03-03T03:08:49Z

run buildall

hello-stephen · 2026-03-03T04:11:23Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	56.94% (20840/36601)
Line Coverage	40.04% (202295/505193)
Region Coverage	36.39% (160845/441974)
Branch Coverage	37.12% (68697/185043)

hello-stephen · 2026-03-03T04:35:10Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	56.70% (20751/36601)
Line Coverage	39.82% (201178/505193)
Region Coverage	36.13% (159684/441974)
Branch Coverage	36.87% (68222/185043)

hello-stephen · 2026-03-03T04:45:42Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.12% (18710/36602)
Line Coverage	35.03% (176951/505201)
Region Coverage	30.96% (136826/441978)
Branch Coverage	32.22% (59623/185047)

hello-stephen · 2026-03-03T04:50:12Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.12% (18710/36602)
Line Coverage	35.03% (176951/505201)
Region Coverage	30.96% (136826/441978)
Branch Coverage	32.22% (59623/185047)

hello-stephen · 2026-03-03T05:14:52Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.12% (18710/36602)
Line Coverage	35.03% (176951/505201)
Region Coverage	30.96% (136826/441978)
Branch Coverage	32.22% (59623/185047)

hello-stephen · 2026-03-03T05:23:26Z

BE UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.59% (19657/37379)
Line Coverage	36.22% (183559/506746)
Region Coverage	32.53% (142408/437837)
Branch Coverage	33.46% (61723/184479)

hello-stephen · 2026-03-03T05:24:24Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.12% (18710/36602)
Line Coverage	35.03% (176951/505201)
Region Coverage	30.96% (136826/441978)
Branch Coverage	32.22% (59623/185047)

hello-stephen · 2026-03-03T05:28:06Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	51.14% (18717/36602)
Line Coverage	35.07% (177185/505201)
Region Coverage	31.00% (137032/441978)
Branch Coverage	32.25% (59675/185047)

BiteTheDDDDt · 2026-03-03T06:04:22Z

run buildall

hello-stephen · 2026-03-03T07:25:41Z

BE UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.57% (19652/37380)
Line Coverage	36.18% (183342/506754)
Region Coverage	32.46% (142136/437841)
Branch Coverage	33.44% (61686/184483)

hello-stephen · 2026-03-03T08:55:31Z

BE Regression && UT Coverage Report

Increment line coverage 36.73% (101/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	56.89% (20824/36602)
Line Coverage	39.95% (201818/505201)
Region Coverage	36.20% (160008/441978)
Branch Coverage	36.98% (68433/185047)

BiteTheDDDDt · 2026-03-03T09:58:38Z

/review

BiteTheDDDDt · 2026-03-03T10:10:22Z

/review

github-actions

Code Review Summary: StringHashTable Batch Operations Optimization

This PR optimizes StringHashTable operations in the aggregation path by pre-grouping row indices by sub-table (based on string key length), then processing each sub-table group in a batch with targeted prefetch. This avoids the per-row dispatch() branching and achieves better cache locality.

Critical Checkpoint Conclusions

1. Goal and correctness: The goal is to optimize StringHashTable's emplace/find hot path for aggregation. The get_group() classification logic correctly mirrors the existing dispatch() logic (including the critical trailing-zero-byte check). The nullable handling correctly replicates the original HashMethodSingleLowNullableColumn::lazy_emplace_key semantics (void vs non-void creator_for_null_key signatures). The convert_key_for_submap correctly uses the same to_string_key<T>() conversions. Appears correct.

2. Modification scope: The change is well-scoped to aggregation operators (sink, source, streaming, distinct streaming, RecCTE). Join operators use a completely different hash table mechanism and are correctly excluded.

3. Concurrency: No new concurrency concerns. The batch operations are called from the same single-threaded per-task execution context as the original per-row loops.

4. Lifecycle management: No special lifecycle concerns. StringKeySubTableGroups is a member of MethodStringNoCache and follows its lifetime.

5. Configuration items: None added.

6. Incompatible changes: None - this is a pure internal optimization with no format/protocol changes.

7. Parallel code paths: The non-StringHashMap fallback path in each batch helper correctly delegates to the original per-row loop with standard prefetch. Partition sort sink (partition_sort_sink_operator.cpp) also uses StringHashMap but iterates in reverse order with early exit, making it architecturally incompatible with this batch optimization - this is acceptable.

8. Test coverage: The PR description has no test checkboxes marked. No new tests are added. While existing aggregation regression tests should cover functional correctness, the PR would benefit from explicit testing with nullable string keys, empty strings, strings with trailing zero bytes, and mixed-length strings to validate the sub-table grouping. Needs attention.

9. Observability: No new observability needed - the existing _hash_table_emplace_timer and _hash_table_input_counter still cover the batch operations.

10. Other issues: See inline comments for specific code concerns.

Issues Found

(Medium) std::forward used inside loops - while safe in current call patterns, this is a latent footgun.
(Low) Dead code: get_submap_m0() through get_submap_ms() accessors are never used.
(Medium) No tests added or marked in checklist despite this being a non-trivial optimization that changes the iteration order of hash table operations.

be/src/vec/common/hash_table/hash_map_context.h

be/src/vec/common/hash_table/string_hash_table.h

be/src/vec/common/hash_table/hash_map_context.h

BiteTheDDDDt · 2026-03-03T13:05:35Z

run buildall

hello-stephen · 2026-03-04T03:07:24Z

BE Regression && UT Coverage Report

Increment line coverage 37.09% (102/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.17% (20926/36604)
Line Coverage	40.27% (203432/505214)
Region Coverage	36.84% (162814/441985)
Branch Coverage	37.53% (69446/185049)

hello-stephen · 2026-03-04T03:50:04Z

BE Regression && UT Coverage Report

Increment line coverage 37.09% (102/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.14% (20917/36604)
Line Coverage	40.23% (203227/505214)
Region Coverage	36.80% (162631/441985)
Branch Coverage	37.47% (69342/185049)

hello-stephen · 2026-03-04T04:36:52Z

BE Regression && UT Coverage Report

Increment line coverage 37.09% (102/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.14% (20917/36604)
Line Coverage	40.23% (203227/505214)
Region Coverage	36.80% (162631/441985)
Branch Coverage	37.47% (69342/185049)

BiteTheDDDDt · 2026-03-04T07:20:37Z

run buildall

hello-stephen · 2026-03-04T09:07:23Z

BE Regression && UT Coverage Report

Increment line coverage 37.09% (102/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.13% (20914/36606)
Line Coverage	40.20% (203102/505269)
Region Coverage	36.76% (162489/442057)
Branch Coverage	37.47% (69349/185072)

hello-stephen · 2026-03-04T09:49:08Z

BE Regression && UT Coverage Report

Increment line coverage 37.09% (102/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.06% (20887/36608)
Line Coverage	40.16% (202919/505315)
Region Coverage	36.77% (162543/442097)
Branch Coverage	37.45% (69323/185094)

hello-stephen · 2026-03-04T10:21:49Z

BE Regression && UT Coverage Report

Increment line coverage 37.09% (102/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.07% (20892/36608)
Line Coverage	40.17% (202999/505315)
Region Coverage	36.80% (162704/442097)
Branch Coverage	37.47% (69350/185094)

hello-stephen · 2026-03-04T10:25:23Z

BE Regression && UT Coverage Report

Increment line coverage 37.09% (102/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.07% (20892/36608)
Line Coverage	40.17% (202999/505315)
Region Coverage	36.80% (162704/442097)
Branch Coverage	37.47% (69350/185094)

hello-stephen · 2026-03-04T10:38:06Z

BE Regression && UT Coverage Report

Increment line coverage 37.09% (102/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.10% (20904/36608)
Line Coverage	40.19% (203087/505315)
Region Coverage	36.77% (162571/442097)
Branch Coverage	37.48% (69377/185094)

hello-stephen · 2026-03-04T12:23:15Z

BE Regression && UT Coverage Report

Increment line coverage 37.09% (102/275) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	57.17% (20927/36608)
Line Coverage	40.23% (203271/505315)
Region Coverage	36.82% (162795/442097)
Branch Coverage	37.51% (69431/185094)

Copilot AI review requested due to automatic review settings March 2, 2026 02:57

Copilot started reviewing on behalf of BiteTheDDDDt March 2, 2026 02:57 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

opt for StringHashTable's for_each

25a41b7

update update

BiteTheDDDDt force-pushed the dev_0302 branch from 74976d8 to 25a41b7 Compare March 2, 2026 05:04

BiteTheDDDDt closed this Mar 3, 2026

BiteTheDDDDt reopened this Mar 3, 2026

github-actions bot reviewed Mar 3, 2026

View reviewed changes

update

3c6b5e1

BiteTheDDDDt force-pushed the dev_0302 branch from a880637 to 3c6b5e1 Compare March 3, 2026 13:05

BiteTheDDDDt changed the title ~~opt for StringHashTable's for_each~~ [Improvement](agg) opt for StringHashTable's for_each Mar 5, 2026

Conversation

BiteTheDDDDt commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Mar 2, 2026

Uh oh!

BiteTheDDDDt commented Mar 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BiteTheDDDDt commented Mar 2, 2026

Uh oh!

BiteTheDDDDt commented Mar 2, 2026

Uh oh!

BiteTheDDDDt commented Mar 2, 2026

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

BiteTheDDDDt commented Mar 2, 2026

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 2, 2026

BE Regression && UT Coverage Report

Uh oh!

BiteTheDDDDt commented Mar 3, 2026

Uh oh!

hello-stephen commented Mar 3, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 3, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 3, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 3, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 3, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Mar 3, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Mar 3, 2026

BE Regression && UT Coverage Report

BiteTheDDDDt commented Mar 2, 2026 •

edited

Loading