[feature](scan) Add value predicate pushdown control for MOR tables#60513
[feature](scan) Add value predicate pushdown control for MOR tables#60513dataroaring wants to merge 5 commits intomasterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
fffa13a to
94bc596
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a session variable to control value column predicate pushdown for MOR (Merge-On-Read) tables, allowing users to selectively enable this optimization per table or globally.
Changes:
- Added session variable
enable_mor_value_predicate_pushdown_tablesfor selective table-level control - Added
isMorTable()helper method to identify MOR tables (UNIQUE_KEYS without merge-on-write) - Modified predicate pushdown logic to support value predicates on MOR tables when enabled
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| gensrc/thrift/PlanNodes.thrift | Added optional boolean field enable_mor_value_predicate_pushdown to TOlapScanNode struct |
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Added session variable declaration, getter, and helper method to check if MOR value predicate pushdown is enabled |
| fe/fe-core/src/main/java/org/apache/doris/catalog/OlapTable.java | Added isMorTable() helper method to identify MOR tables |
| fe/fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java | Set thrift flag in toThrift() based on table type and session variable |
| be/src/pipeline/exec/scan_operator.h | Added virtual method _should_push_down_mor_value_predicate() with default false implementation |
| be/src/pipeline/exec/scan_operator.cpp | Modified predicate pushdown condition to include MOR value predicate check |
| be/src/pipeline/exec/olap_scan_operator.h | Declared override for _should_push_down_mor_value_predicate() |
| be/src/pipeline/exec/olap_scan_operator.cpp | Implemented _should_push_down_mor_value_predicate() to read thrift flag |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fe/fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java
Outdated
Show resolved
Hide resolved
|
|
||
| // Set MOR value predicate pushdown flag based on session variable | ||
| if (olapTable.isMorTable() && ConnectContext.get() != null) { | ||
| String dbName = olapTable.getQualifiedDbName(); |
There was a problem hiding this comment.
Potential null pointer exception. The method getQualifiedDbName() can return null (as seen in Table.java line 367-369). This null value would then be passed to isMorValuePredicatePushdownEnabled() which concatenates it with the table name on line 4691, potentially resulting in "null.tableName". Consider adding a null check for dbName or using an alternative method like getDBName() which handles the null case.
| String dbName = olapTable.getQualifiedDbName(); | |
| String dbName = olapTable.getQualifiedDbName(); | |
| if (dbName == null) { | |
| dbName = olapTable.getDBName(); | |
| } |
| String trimmed = enableMorValuePredicatePushdownTables.trim(); | ||
| if ("*".equals(trimmed)) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
The method doesn't handle null dbName parameter. If dbName is null, line 4691 will create "null.tableName" which could lead to unexpected behavior. Consider adding a null check for dbName at the beginning of the method, or using Objects.requireNonNull() to fail fast, or handle the null case explicitly by using only tableName for comparison when dbName is null.
| } | |
| } | |
| // When dbName is null, only compare using tableName to avoid creating "null.tableName". | |
| if (dbName == null) { | |
| for (String table : trimmed.split(",")) { | |
| if (table.trim().equalsIgnoreCase(tableName)) { | |
| return true; | |
| } | |
| } | |
| return false; | |
| } |
| public boolean isMorValuePredicatePushdownEnabled(String dbName, String tableName) { | ||
| if (enableMorValuePredicatePushdownTables == null | ||
| || enableMorValuePredicatePushdownTables.isEmpty()) { | ||
| return false; | ||
| } | ||
| String trimmed = enableMorValuePredicatePushdownTables.trim(); | ||
| if ("*".equals(trimmed)) { | ||
| return true; | ||
| } | ||
| String fullName = dbName + "." + tableName; | ||
| for (String table : trimmed.split(",")) { | ||
| if (table.trim().equalsIgnoreCase(fullName) | ||
| || table.trim().equalsIgnoreCase(tableName)) { | ||
| return true; | ||
| } | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
The new session variable and its helper method isMorValuePredicatePushdownEnabled() lack test coverage. Consider adding unit tests in SessionVariablesTest to verify:
- Empty string handling (default behavior)
- Wildcard '*' behavior
- Single table name matching (with and without database prefix)
- Multiple table names (comma-separated)
- Case-insensitive matching
- Whitespace handling in table names
- Edge cases like null dbName parameter
| if (table.trim().equalsIgnoreCase(fullName) | ||
| || table.trim().equalsIgnoreCase(tableName)) { |
There was a problem hiding this comment.
The method doesn't handle edge cases well when splitting by comma. If the variable contains only commas or has consecutive commas (e.g., "db.table1,,db.table2"), the split operation will produce empty strings. Calling trim() on empty strings and comparing them could lead to false positives. Consider filtering out empty strings after trimming, or using a more robust parsing approach.
| if (table.trim().equalsIgnoreCase(fullName) | |
| || table.trim().equalsIgnoreCase(tableName)) { | |
| String normalized = table.trim(); | |
| if (normalized.isEmpty()) { | |
| continue; | |
| } | |
| if (normalized.equalsIgnoreCase(fullName) | |
| || normalized.equalsIgnoreCase(tableName)) { |
94bc596 to
8eadc41
Compare
|
run buildall |
8eadc41 to
4a5d300
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 31991 ms |
ClickBench: Total hot run time: 28.16 s |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
4a5d300 to
a06783e
Compare
a0385a9 to
7d4aa18
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 31169 ms |
ClickBench: Total hot run time: 28.05 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…own for MOR tables Add a new session variable `enable_mor_value_predicate_pushdown_tables` to allow users to selectively enable value column predicate pushdown for MOR (Merge-On-Read) tables. This can improve query performance by utilizing inverted indexes on value columns for filtering. The session variable accepts: - Comma-separated table names: `db1.tbl1,db2.tbl2` - Wildcard for all MOR tables: `*` - Empty string to disable (default) Changes: - Add session variable in SessionVariable.java with helper method - Add isMorTable() helper in OlapTable.java - Add Thrift field enable_mor_value_predicate_pushdown in TOlapScanNode - Set flag in OlapScanNode.toThrift() based on session variable - Add virtual method _should_push_down_mor_value_predicate() in scan_operator - Implement override in olap_scan_operator to read the flag - Modify predicate pushdown condition in scan_operator.cpp
…bles Add session variable enable_mor_value_predicate_pushdown_tables to control value predicate pushdown for MOR (Merge-On-Read) tables. When enabled, value column predicates are pushed to the storage layer for filtering. Key changes: - Add enable_mor_value_predicate_pushdown flag to ReaderParams and RowsetReaderContext - Extend BetaRowsetReader::_should_push_down_value_predicates() to check the flag - Drop __DORIS_DELETE_SIGN__ predicate from storage-layer pushdown to prevent deleted rows from reappearing (delete sign must only be evaluated post-merge) - Add regression test covering dedup, delete sign, delete predicate, and update scenarios
…OR tables without merge Add a per-table session variable that allows reading MOR (Merge-On-Read) UNIQUE tables as DUP tables, skipping storage engine merge and delete sign filter while still honoring delete predicates. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…omponent parsing Address review feedback: replace naive string concatenation matching with component-based parsing (split by ".") consistent with TableNameInfo, fixing null dbName handling and empty entry edge cases. Co-Authored-By: Claude Opus 4.6 <[email protected]>
0bfacc2 to
1c6db77
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29334 ms |
TPC-DS: Total hot run time: 184013 ms |
|
/review |
There was a problem hiding this comment.
Code Review Summary
This PR adds two session variables (enable_mor_value_predicate_pushdown_tables and read_mor_as_dup_tables) to control value column predicate pushdown and merge-skipping for MOR (Merge-On-Read) unique key tables. The implementation spans FE (session variable, plan generation, Nereids rules), Thrift protocol, and BE (scan operator, tablet reader, rowset reader), with regression tests and FE unit tests.
Critical Checkpoint Conclusions
1. Goal and correctness: The feature enables per-segment value predicate pushdown for MOR tables. However, enable_mor_value_predicate_pushdown can produce incorrect results on MOR tables with sequence columns (see inline comment on beta_rowset_reader.cpp). The existing safe path explicitly guards against this with sequence_id_idx == -1, but the new flag bypasses this check unconditionally.
2. Modification scope and focus: The changes are well-scoped to the feature. Two session variables for two distinct behaviors (value predicate pushdown vs full DUP-mode reading) is a clean design.
3. Concurrency: No concurrency concerns — the feature propagates immutable flags through the read path.
4. Lifecycle management: No lifecycle concerns.
5. Configuration items: Two session variables added with needForward = true. read_mor_as_dup_tables correctly has affectQueryResultInPlan = true. However, enable_mor_value_predicate_pushdown_tables does NOT have affectQueryResultInPlan despite the fact that it can affect query results (the test explicitly demonstrates stale data).
6. Incompatible changes: New optional Thrift fields (24, 25) on TOlapScanNode are backward-compatible.
7. Parallel code paths: All parallel paths (Nereids planner, legacy planner via shared OlapScanNode.toThrift(), BE storage layer) are covered.
8. Special conditional checks: The delete sign exclusion logic in tablet_reader.cpp is correct and well-commented.
9. Test coverage: Good breadth (dedup, delete markers, delete predicates, MOW/DUP unaffected, multiple tables, wildcard). However, no test for MOR tables with sequence columns — the most dangerous scenario. Regression tests have style violations (tables dropped at end, def tableName usage).
10. Observability: No new metrics or logging added. Given this is a user-controlled session variable, this is acceptable.
11. Transaction/persistence: Not applicable.
12. Data writes: Not applicable (read-only feature).
13. FE-BE variable passing: Thrift fields added and set in OlapScanNode.toThrift(), consumed correctly in BE.
14. Performance: The isTableInList() helper parses the comma-separated string on every call without caching. For queries scanning many partitions/tablets this could add up, though it's likely negligible in practice.
15. Other issues: NPE risk in LogicalOlapScan.java where ConnectContext.get() is called without null check (see inline comment).
Issue Summary
| # | Severity | Description |
|---|---|---|
| 1 | HIGH | enable_mor_value_predicate_pushdown bypasses sequence column safety check — can produce incorrect results on MOR tables with sequence columns |
| 2 | HIGH | NPE risk in LogicalOlapScan.java — ConnectContext.get() not null-checked |
| 3 | MEDIUM | Test explicitly asserts stale/incorrect data as expected behavior (select_update_enabled_old), proving the feature can return wrong results when keys are updated |
| 4 | MEDIUM | Redundant C-style casts of _local_state in olap_scanner.cpp |
| 5 | LOW | Regression test style: tables dropped at end of test, def tableName usage |
| _read_context->enable_unique_key_merge_on_write); | ||
| _read_context->enable_unique_key_merge_on_write || | ||
| _read_context->enable_mor_value_predicate_pushdown); | ||
| } |
There was a problem hiding this comment.
[HIGH] Sequence column safety bypass: The new enable_mor_value_predicate_pushdown flag is OR'd unconditionally, bypassing the existing safety check _read_context->sequence_id_idx == -1.
When a MOR table has a sequence column, the merge winner is determined by the sequence value (not version order). Pushing value predicates to individual segments can filter out the correct merge winner, causing a different (wrong) row to survive merge.
Example: key=1 in Segment A has (value=10, seq=100) and in Segment B has (value=20, seq=50). With predicate value > 15:
- Without pushdown: merge picks A (seq=100 wins), then predicate filters it → empty. Correct.
- With pushdown: A filtered at segment level, B passes. Merge sees only B → returns
(key=1, value=20). Wrong — A should have won.
Consider either:
- Adding a
sequence_id_idx == -1guard to this branch as well, or - Documenting this limitation clearly in the session variable description and adding a WARNING log when the flag is enabled on a table with a sequence column.
| if (getTable().getKeysType() == KeysType.UNIQUE_KEYS | ||
| && getTable().isMorTable() | ||
| && ConnectContext.get().getSessionVariable().isReadMorAsDupEnabled( | ||
| getTable().getQualifiedDbName(), getTable().getName())) { |
There was a problem hiding this comment.
[HIGH] NPE risk: ConnectContext.get() can return null (e.g., during background tasks, internal metadata operations, or unit tests without thread-local ConnectContext). This will cause a NullPointerException.
Other places in this PR correctly guard with ConnectContext.get() != null (e.g., SetPreAggStatus.java line 158, OlapScanNode.java line 1194). This location should be consistent:
if (ConnectContext.get() != null
&& getTable().getKeysType() == KeysType.UNIQUE_KEYS
&& getTable().isMorTable()
&& ConnectContext.get().getSessionVariable().isReadMorAsDupEnabled(
getTable().getQualifiedDbName(), getTable().getName())) {
return;
}Note: the pre-existing skipDeleteBitmap check at line 768 has the same NPE risk.
| if (olap_scan_node.__isset.enable_mor_value_predicate_pushdown && | ||
| olap_scan_node.enable_mor_value_predicate_pushdown) { | ||
| _tablet_reader_params.enable_mor_value_predicate_pushdown = true; | ||
| } |
There was a problem hiding this comment.
[MEDIUM] enable_mor_value_predicate_pushdown is only set when !skip_storage_engine_merge(): This means when read_mor_as_dup is active (which sets direct_mode = true and skips merge at line 340), this flag is NOT set. This is correct because in DUP mode there's no merge to interfere with. But the interaction between these two flags should be documented — currently a reader has to trace through both code paths to understand this.
Also note: at line 337, olap_local_state is obtained via proper static_cast, but at line 480 (just above this block), olap_scan_local_state is obtained via C-style cast (pipeline::OlapScanLocalState*). Please use consistent static_cast and consider reusing the variable from line 337 to avoid the redundant cast.
|
|
||
| // v1=100: rowset 2 (v1=500) filtered per-segment, old version survives. | ||
| // Returns stale data — this proves pushdown is filtering at storage layer. | ||
| qt_select_update_enabled_old """ |
There was a problem hiding this comment.
[MEDIUM] Test explicitly asserts stale data as correct behavior: This test proves that enable_mor_value_predicate_pushdown can return incorrect results (1, 100, old instead of empty) when keys have been updated across rowsets. The .out file records this stale result as the expected output.
While the test comment explains this is intentional to prove the pushdown is happening, having a test that asserts incorrect query results as "expected" is concerning. Users who enable this session variable may not realize it can return stale data for any key that has been updated.
Consider:
- Adding a prominent WARNING in the session variable description that this feature is only safe for insert-only/dedup-only workloads and will produce incorrect results if keys are ever updated.
- Potentially logging a warning when the session variable is set.
| """ | ||
|
|
||
| // Cleanup | ||
| sql "SET enable_mor_value_predicate_pushdown_tables = ''" |
There was a problem hiding this comment.
[LOW] Test style: Per coding standards, tables should be dropped before use (not after tests end), to preserve the environment for debugging. This cleanup block at the end should be removed — the DROP TABLE IF EXISTS before each CREATE TABLE already handles cleanup.
| assertTrue(table2DupResult.size() == 2, "Table2 in DUP mode should return 2 rows, got ${table2DupResult.size()}") | ||
|
|
||
| // === Cleanup === | ||
| sql "SET read_mor_as_dup_tables = '';" |
There was a problem hiding this comment.
[LOW] Test style: Same issue as the other test — tables should not be dropped at the end of tests. Remove this cleanup block to preserve tables for debugging.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 27781 ms |
TPC-DS: Total hot run time: 153312 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Summary
enable_mor_value_predicate_pushdown_tablesto control value column predicate pushdown for MOR (Merge-On-Read) tablesdb.table,table, or*), thrift flag on TOlapScanNode,isMorTable()helper on OlapTable_should_push_down_value_predicates()for all rowsets; skip__DORIS_DELETE_SIGN__to preserve delete correctness; keep VExpr in conjuncts as post-merge safety netMotivation
MOR tables with inverted indexes on value columns cannot utilize those indexes for filtering because value predicates are not pushed to the storage layer. This feature enables per-segment value predicate pushdown for dedup-only/insert-only MOR workloads where the same key always carries identical values across rowsets, allowing inverted indexes and zone maps to filter data early.
Design
Two-layer predicate approach:
_conjunctsas post-merge safety netDelete sign handling:
__DORIS_DELETE_SIGN__predicate is excluded from per-segment pushdown to prevent delete markers from being filtered before merge, which would cause deleted rows to reappear.Test plan
INSERT INTO (..., __DORIS_DELETE_SIGN__) VALUES (..., 1)DELETE FROM ... WHERE