Eng 39590/velox mor column projection repro ut by Davis-Zhang-Onehouse · Pull Request #571 · apache/hudi-rs

Davis-Zhang-Onehouse · 2026-03-31T21:29:03Z

Description

How are the changes test-covered

N/A
Automated tests (unit and/or integration tests)
Manual tests
- Details are described below

…precombine.field

… bug Adds crates/core/tests/velox_mor_repro_tests.rs — a pure Rust integration test that mirrors exactly how Velox/Gluten constructs hudi-rs components when processing a Hudi MOR split via the C++ FFI bridge. The test replicates the three-step call chain from HudiSplitReader::prepareSplit(): 1. FileGroupReader::new_with_options(base_uri, []) (hudiOptions always empty) 2. FileGroup::new_with_base_file_name + add_log_files_from_names (per partition) 3. reader.read_file_slice(&file_slice) (MOR merge) Uses the real test table created by dev/test-hudi-mor.sh in gluten-internal (/home/ubuntu/ws1/test-data/hudi_mor_partitioned, 4 partitions, v9 MOR, EVENT_TIME_ORDERING). eprintln! log lines mirror the [HudiMOR] Velox log format so both outputs can be compared side-by-side to prove equivalence. Reproduces the symptom from ENG-39590: "Parquet full read (non-streaming) for '...': reading all 10 cols: [...]" appears for every partition, while "Parquet column projection for ..." never fires — confirming that no column projection is passed into hudi-rs from the Velox integration layer (ReadSchema is ignored). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When hudi-rs receives hoodie.read.output.columns (a comma-separated list of column names from the caller's ReadSchema), it now projects the base parquet file to only the columns required for a correct MOR merge: output_cols ∪ {_hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, <ordering_field>} Unnecessary columns (_hoodie_partition_path, _hoodie_file_name, partition value columns, etc.) are pruned before parquet decoding. Log file batches are also projected to the same schema before concatenation with the base batch. For the test table (10 cols) with ReadSchema={id,name,age}, this reduces each parquet read from 10 → 7 columns. Changes: - config/read.rs: add HudiReadConfig::OutputColumns key "hoodie.read.output.columns" - storage/mod.rs: add get_parquet_file_data_with_options that delegates to get_parquet_file_stream (projection + log already there); fix get_parquet_file_stream to return the projected schema rather than the full file schema - file_group/reader.rs: inject projection in read_file_slice_by_base_file_path when OutputColumns is set; project log batches to match base schema before concat - tests/velox_mor_repro_tests.rs: pass hoodie.read.output.columns=id,name,age and assert 7-col projected batch (not 10) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vinishjail97 and others added 4 commits March 25, 2026 23:45

feat(config): support hoodie.table.ordering.fields as alternative to …

c13ed56

…precombine.field

add comprehensive parquet file read debug log

bc231c9

Davis-Zhang-Onehouse requested a review from xushiyan as a code owner March 31, 2026 21:29

save progress

557dcb9

xushiyan marked this pull request as draft April 2, 2026 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eng 39590/velox mor column projection repro ut#571

Eng 39590/velox mor column projection repro ut#571
Davis-Zhang-Onehouse wants to merge 5 commits intoapache:mainfrom
vinishjail97:ENG-39590/velox-mor-column-projection-repro-ut

Davis-Zhang-Onehouse commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Davis-Zhang-Onehouse commented Mar 31, 2026

Description

How are the changes test-covered

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants