Eng 39590/velox mor column projection repro ut#571
Draft
Davis-Zhang-Onehouse wants to merge 5 commits intoapache:mainfrom
Draft
Eng 39590/velox mor column projection repro ut#571Davis-Zhang-Onehouse wants to merge 5 commits intoapache:mainfrom
Davis-Zhang-Onehouse wants to merge 5 commits intoapache:mainfrom
Conversation
… bug Adds crates/core/tests/velox_mor_repro_tests.rs — a pure Rust integration test that mirrors exactly how Velox/Gluten constructs hudi-rs components when processing a Hudi MOR split via the C++ FFI bridge. The test replicates the three-step call chain from HudiSplitReader::prepareSplit(): 1. FileGroupReader::new_with_options(base_uri, []) (hudiOptions always empty) 2. FileGroup::new_with_base_file_name + add_log_files_from_names (per partition) 3. reader.read_file_slice(&file_slice) (MOR merge) Uses the real test table created by dev/test-hudi-mor.sh in gluten-internal (/home/ubuntu/ws1/test-data/hudi_mor_partitioned, 4 partitions, v9 MOR, EVENT_TIME_ORDERING). eprintln! log lines mirror the [HudiMOR] Velox log format so both outputs can be compared side-by-side to prove equivalence. Reproduces the symptom from ENG-39590: "Parquet full read (non-streaming) for '...': reading all 10 cols: [...]" appears for every partition, while "Parquet column projection for ..." never fires — confirming that no column projection is passed into hudi-rs from the Velox integration layer (ReadSchema is ignored). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When hudi-rs receives hoodie.read.output.columns (a comma-separated list
of column names from the caller's ReadSchema), it now projects the base
parquet file to only the columns required for a correct MOR merge:
output_cols ∪ {_hoodie_commit_time, _hoodie_commit_seqno,
_hoodie_record_key, <ordering_field>}
Unnecessary columns (_hoodie_partition_path, _hoodie_file_name,
partition value columns, etc.) are pruned before parquet decoding.
Log file batches are also projected to the same schema before
concatenation with the base batch.
For the test table (10 cols) with ReadSchema={id,name,age}, this
reduces each parquet read from 10 → 7 columns.
Changes:
- config/read.rs: add HudiReadConfig::OutputColumns
key "hoodie.read.output.columns"
- storage/mod.rs: add get_parquet_file_data_with_options that delegates
to get_parquet_file_stream (projection + log already there);
fix get_parquet_file_stream to return the projected schema rather than
the full file schema
- file_group/reader.rs: inject projection in
read_file_slice_by_base_file_path when OutputColumns is set;
project log batches to match base schema before concat
- tests/velox_mor_repro_tests.rs: pass hoodie.read.output.columns=id,name,age
and assert 7-col projected batch (not 10)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
How are the changes test-covered