[Chore](pick) pick #60141 #59410#61287
Conversation
…efault value (apache#60141) Increasing the batch_size will result in an overall performance improvement of approximately 5% in Doris (some operators become slower, while others become faster). This pull request makes adjustments to the default configuration values in the `SessionVariable` class to improve query parallelism and batch processing efficiency. The changes increase the maximum number of parallel exchange instances and the default batch size. Configuration defaults updated: * Increased the default value of `exchangeInstanceParallel` from 100 to 256, allowing for more parallel exchange instances during query execution. * Increased the default value of `batchSize` from 4064 to 8160, enabling larger batch processing in queries.
<img width="298" height="2142" alt="图片" src="https://github.com/user-attachments/assets/3be70146-c4bd-4ff7-ac96-2645f89fed14" /> This pull request refactors and optimizes the handling of null maps and key packing in hash join and hash table code, with a focus on improving SIMD (Single Instruction, Multiple Data) usage and simplifying null bitmap logic. The changes replace older byte-searching utilities with new, more efficient SIMD-based functions, update how null bitmaps are packed and processed, and streamline column null data replacement. Additionally, the logic for determining hash key types and handling fixed key serialization is improved for better correctness and performance. Key improvements and changes: * Introduced new SIMD-based functions `contain_one` and `contain_zero` in `simd/bits.h`, replacing the older `contain_byte` and related logic for checking the presence of ones or zeros in null maps, resulting in more efficient null detection. * Updated all usages of null map checks throughout the codebase to use the new `contain_one` and `contain_zero` functions, simplifying and unifying the logic for detecting nulls in columns and filters. [[1]](diffhunk://#diff-0732e01c1a3f38997ada381c43aff98286e86ca7519db5469a6e4dcdec5bce44L195-L200) [[2]](diffhunk://#diff-3110bab7d558f46b88ae1958b09ac369a92cac4bff98b280b2cf83d2d7aecbf4L117-R118) [[3]](diffhunk://#diff-3110bab7d558f46b88ae1958b09ac369a92cac4bff98b280b2cf83d2d7aecbf4L369-R371) [[4]](diffhunk://#diff-8981dd2e1f08aaa46a97aeef27bd906c64d1bb08deedc0fe1d94c1c49dc064ceL100-R100) [[5]](diffhunk://#diff-9fd61a223bcb3b7a9cb93c2d26c9364d8cce2131673fe286f22a80b09c6fd2c6L283-R283) [[6]](diffhunk://#diff-9fd61a223bcb3b7a9cb93c2d26c9364d8cce2131673fe286f22a80b09c6fd2c6L601-R605) * Refactored the logic for packing null maps into hash keys in `MethodKeysFixed`, introducing new templates and helper functions for interleaved null map packing, and replacing the old bitmap size calculation with a simplified approach. This improves both performance and maintainability. [[1]](diffhunk://#diff-b8623712a5a1728bb77cc67b6ee1bbf16ef2b842044f6f6bab64c3fc5c4575f3R478-R540) [[2]](diffhunk://#diff-b8623712a5a1728bb77cc67b6ee1bbf16ef2b842044f6f6bab64c3fc5c4575f3L500-R611) * Updated the logic for initializing and inserting keys, ensuring correct handling of nulls and simplifying offset calculations for key data. [[1]](diffhunk://#diff-b8623712a5a1728bb77cc67b6ee1bbf16ef2b842044f6f6bab64c3fc5c4575f3R653) [[2]](diffhunk://#diff-b8623712a5a1728bb77cc67b6ee1bbf16ef2b842044f6f6bab64c3fc5c4575f3L619-R692) [[3]](diffhunk://#diff-b8623712a5a1728bb77cc67b6ee1bbf16ef2b842044f6f6bab64c3fc5c4575f3L645-R712) * Simplified the `replace_column_null_data` methods for vector and decimal columns by removing unnecessary null count checks and optimizing the replacement logic. [[1]](diffhunk://#diff-3fa47f544ff08bb2c8232af99312c0bbf2c58cac9da7a2b06473282b99ad5aa4L528-R530) [[2]](diffhunk://#diff-5fdf450def955da3201cc889aa870d94def054d1168f1ef3def32e8f009dc65aL526-L529) * Improved the logic for determining the hash key type in `hash_key_type.h` to handle cases where the number of data types exceeds the bit size, defaulting to serialized keys as needed. [[1]](diffhunk://#diff-4f1fb8a89cd0e13a719c3427b1ae7581b42cb7325755a3ceac4c44bdc64bd144R83-R86) [[2]](diffhunk://#diff-4f1fb8a89cd0e13a719c3427b1ae7581b42cb7325755a3ceac4c44bdc64bd144L97-R101) * Removed unused functions and updated includes to ensure all SIMD utilities are properly imported where needed. [[1]](diffhunk://#diff-b8623712a5a1728bb77cc67b6ee1bbf16ef2b842044f6f6bab64c3fc5c4575f3R20-R26) [[2]](diffhunk://#diff-b8623712a5a1728bb77cc67b6ee1bbf16ef2b842044f6f6bab64c3fc5c4575f3R36) [[3]](diffhunk://#diff-b8623712a5a1728bb77cc67b6ee1bbf16ef2b842044f6f6bab64c3fc5c4575f3L292-L295) These changes collectively improve performance, maintainability, and correctness in hash join operations, especially in handling nullable columns and SIMD optimizations.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Cherry-picks changes from #60141 and #59410, primarily updating SIMD-based null-map utilities/usages and adjusting a couple of execution/session defaults.
Changes:
- Update multiple null-map checks to use new
simd::contain_one/contain_zerohelpers. - Refactor packing of fixed hash keys/null bitmaps to reduce work when columns have no nulls.
- Adjust FE session defaults (
exchangeInstanceParallel,batchSize) and fix/align some BE tests.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Updates default session variables for exchange parallelism and batch size. |
| be/test/vec/columns/column_nullable_test.cpp | Adjusts test construction of null map column initialization. |
| be/src/vec/functions/function_jsonb.cpp | Switches “all null” detection to contain_zero helper. |
| be/src/vec/data_types/serde/data_type_map_serde.cpp | Updates null-key detection in map->Arrow serialization. |
| be/src/vec/core/column_with_type_and_name.cpp | Avoids null-data replacement when null map contains no nulls. |
| be/src/vec/common/hash_table/hash_map_context.h | Refactors fixed-key packing and bitmap logic; adds SIMD helpers/includes. |
| be/src/vec/common/hash_table/hash_key_type.h | Changes fixed hash key type selection rules with new bitmap sizing logic. |
| be/src/vec/columns/column_vector.cpp | Changes null-data replacement implementation for vector columns. |
| be/src/vec/columns/column_nullable.cpp | Replaces byte-scans with contain_one/contain_zero. |
| be/src/vec/columns/column_decimal.cpp | Changes null-data replacement implementation for decimal columns. |
| be/src/util/simd/bits.h | Removes contain_byte, adds contain_one/contain_zero SIMD helpers. |
| be/src/pipeline/exec/nested_loop_join_probe_operator.h | Updates filter scan to contain_one. |
| be/src/pipeline/exec/join/process_hash_table_probe_impl.h | Updates null detection and filter checks to new SIMD helpers; adds guards before rewriting nulls. |
| be/src/pipeline/exec/hashjoin_build_sink.h | Replaces loop-based null detection with SIMD helper; avoids rewriting when no nulls. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -1216,7 +1216,7 @@ public void checkQuerySlotCount(String slotCnt) { | |||
|
|
|||
| // 4096 minus 16 + 16 bytes padding that in padding pod array | |||
There was a problem hiding this comment.
The comment describing the batchSize default ("4096 minus 16 + 16...") no longer matches the new default value 8160. Update the comment to reflect the new sizing rationale/constraints, or adjust the value if the intent is still a ~4K payload.
| // 4096 minus 16 + 16 bytes padding that in padding pod array | |
| // 8160 + 16 + 16 = 8192 |
| size_t bitmap_size = nullmap_columns.empty() ? 0 : 1; | ||
| if (bitmap_size) { | ||
| // set size to 0 at first, then use resize to call default constructor on index included from [0, row_numbers) to reset all memory | ||
| // only need to reset the memory used to bitmap | ||
| result.clear(); | ||
| } | ||
| result.resize(row_numbers); |
There was a problem hiding this comment.
When bitmap_size == 0, the code no longer clears/value-initializes result before writing key bytes. If sizeof(T) is larger than the total bytes written by pack_fixeds (common when key_byte_size < sizeof(T)), padding/unused bytes can retain stale data and corrupt hash/equality behavior. Fix by ensuring the entire result buffer is zero-initialized for all rows (e.g., unconditional clear()+resize(), or an explicit memset over row_numbers * sizeof(T) before packing).
|
|
||
| for (size_t j = 0; j < key_columns.size(); ++j) { | ||
| const char* data = key_columns[j]->get_raw_data().data; | ||
| const char* __restrict data = key_columns[j]->get_raw_data().data; |
There was a problem hiding this comment.
This introduces two correctness risks: (1) assume_mutable() can create a new column instance when the original is shared; the returned mutable is not stored back into key_columns, so the null replacement may have no lasting effect. (2) Even if it mutates in-place, you captured data before the mutation; if assume_mutable() performs a clone/reallocation, data can become stale. Safer options: avoid mutating key_columns here (restore the previous per-row conditional copy of zero/default for nulls), or store/propagate the mutable column back and re-fetch data after mutation.
| if (has_null_column.size() && has_null_column[j]) { | ||
| const auto* nullmap = | ||
| assert_cast<const ColumnUInt8&>(*nullmap_columns[j]).get_data().data(); | ||
| for (size_t i = 0; i < row_numbers; ++i) { | ||
| // make sure null cell is filled by 0x0 | ||
| memcpy_fixed<Fixed, true>( | ||
| (char*)(&result[i]) + offset, | ||
| nullmap[i] ? (char*)&zero : data + i * sizeof(Fixed)); | ||
| } | ||
| } else { | ||
| for (size_t i = 0; i < row_numbers; ++i) { | ||
| memcpy_fixed<Fixed, true>((char*)(&result[i]) + offset, | ||
| data + i * sizeof(Fixed)); | ||
| } | ||
| // make sure null cell is filled by 0x0 | ||
| key_columns[j]->assume_mutable()->replace_column_null_data(nullmap); | ||
| } | ||
| auto* __restrict current = result_data + offset; | ||
| for (size_t i = 0; i < row_numbers; ++i) { | ||
| memcpy_fixed<Fixed, true>(current, data); | ||
| current += sizeof(T); | ||
| data += sizeof(Fixed); |
There was a problem hiding this comment.
This introduces two correctness risks: (1) assume_mutable() can create a new column instance when the original is shared; the returned mutable is not stored back into key_columns, so the null replacement may have no lasting effect. (2) Even if it mutates in-place, you captured data before the mutation; if assume_mutable() performs a clone/reallocation, data can become stale. Safer options: avoid mutating key_columns here (restore the previous per-row conditional copy of zero/default for nulls), or store/propagate the mutable column back and re-fetch data after mutation.
| } | ||
|
|
||
| inline HashKeyType get_hash_key_type_fixed(const std::vector<vectorized::DataTypePtr>& data_types) { | ||
| if (data_types.size() >= vectorized::BITSIZE) { |
There was a problem hiding this comment.
The >= vectorized::BITSIZE condition forces exactly BITSIZE keys to use serialized, even though other parts of this PR (e.g., bitmap stored in 1 byte and checks like <= BITSIZE) still appear to support up to BITSIZE keys. If BITSIZE is 8 (as implied by the 1-byte bitmap), this likely regresses the fixed-key path for 8 keys. Consider changing the threshold to > vectorized::BITSIZE (or aligning the other checks/packing logic to match this new restriction).
| if (data_types.size() >= vectorized::BITSIZE) { | |
| if (data_types.size() > vectorized::BITSIZE) { |
|
|
||
| size_t bitmap_size = has_null ? vectorized::get_bitmap_size(data_types.size()) : 0; | ||
| return get_hash_key_type_with_fixed(bitmap_size + key_byte_size); | ||
| return get_hash_key_type_with_fixed(has_null + key_byte_size); |
There was a problem hiding this comment.
The >= vectorized::BITSIZE condition forces exactly BITSIZE keys to use serialized, even though other parts of this PR (e.g., bitmap stored in 1 byte and checks like <= BITSIZE) still appear to support up to BITSIZE keys. If BITSIZE is 8 (as implied by the 1-byte bitmap), this likely regresses the fixed-key path for 8 keys. Consider changing the threshold to > vectorized::BITSIZE (or aligning the other checks/packing logic to match this new restriction).
| void ColumnVector<T>::replace_column_null_data(const uint8_t* __restrict null_map) { | ||
| auto s = size(); | ||
| size_t null_count = s - simd::count_zero_num((const int8_t*)null_map, s); | ||
| if (0 == null_count) { | ||
| return; | ||
| } | ||
| auto value = default_value(); | ||
| for (size_t i = 0; i < s; ++i) { | ||
| data[i] = null_map[i] ? default_value() : data[i]; | ||
| data[i] = null_map[i] ? value : data[i]; | ||
| } |
There was a problem hiding this comment.
The early-exit when the null map contains no nulls was removed, so this now always scans and potentially writes the full column. If this method is invoked on many columns/blocks where nulls are rare (and not all call sites gate it with simd::contain_one), this can be a measurable regression. Consider reintroducing a fast-path check (e.g., if (!simd::contain_one(null_map, s)) return;) before the loop.
| @@ -486,10 +486,6 @@ void ColumnDecimal<T>::compare_internal(size_t rhs_row_id, const IColumn& rhs, | |||
| template <PrimitiveType T> | |||
| void ColumnDecimal<T>::replace_column_null_data(const uint8_t* __restrict null_map) { | |||
| auto s = size(); | |||
There was a problem hiding this comment.
Same as ColumnVector::replace_column_null_data: removing the no-nulls early return can introduce a full scan/write even when null_map contains no 1s. Consider adding a fast-path guard (e.g., if (!simd::contain_one(null_map, s)) return;) to avoid unnecessary work in hot paths.
| auto s = size(); | |
| auto s = size(); | |
| if (!simd::contain_one(null_map, s)) { | |
| return; | |
| } |
| template <int N> | ||
| void pack_nullmaps_interleaved(const uint8_t* const* datas, const uint8_t* bit_offsets, | ||
| size_t row_numbers, size_t stride, uint8_t* __restrict out) { |
There was a problem hiding this comment.
The fixed-key nullmap packing logic was substantially changed (interleaved packing + skipping columns without nulls). Add focused unit tests to cover: (1) packing correctness for multiple nullable columns (including sparse nulls), (2) behavior when some nullable columns have no nulls (should not break bit positions), and (3) ensuring packed keys are deterministic for null rows (no dependence on original underlying value bytes).
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
pick #60141 #59410