Skip to content

Eng 39590/velox mor column projection repro ut#571

Draft
Davis-Zhang-Onehouse wants to merge 5 commits intoapache:mainfrom
vinishjail97:ENG-39590/velox-mor-column-projection-repro-ut
Draft

Eng 39590/velox mor column projection repro ut#571
Davis-Zhang-Onehouse wants to merge 5 commits intoapache:mainfrom
vinishjail97:ENG-39590/velox-mor-column-projection-repro-ut

Conversation

@Davis-Zhang-Onehouse
Copy link
Copy Markdown

Description

How are the changes test-covered

  • N/A
  • Automated tests (unit and/or integration tests)
  • Manual tests
    • Details are described below

vinishjail97 and others added 4 commits March 25, 2026 23:45
… bug

Adds crates/core/tests/velox_mor_repro_tests.rs — a pure Rust integration
test that mirrors exactly how Velox/Gluten constructs hudi-rs components
when processing a Hudi MOR split via the C++ FFI bridge.

The test replicates the three-step call chain from HudiSplitReader::prepareSplit():
  1. FileGroupReader::new_with_options(base_uri, [])   (hudiOptions always empty)
  2. FileGroup::new_with_base_file_name + add_log_files_from_names  (per partition)
  3. reader.read_file_slice(&file_slice)               (MOR merge)

Uses the real test table created by dev/test-hudi-mor.sh in gluten-internal
(/home/ubuntu/ws1/test-data/hudi_mor_partitioned, 4 partitions, v9 MOR,
EVENT_TIME_ORDERING). eprintln! log lines mirror the [HudiMOR] Velox log
format so both outputs can be compared side-by-side to prove equivalence.

Reproduces the symptom from ENG-39590:
  "Parquet full read (non-streaming) for '...': reading all 10 cols: [...]"
appears for every partition, while
  "Parquet column projection for ..."
never fires — confirming that no column projection is passed into hudi-rs
from the Velox integration layer (ReadSchema is ignored).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When hudi-rs receives hoodie.read.output.columns (a comma-separated list
of column names from the caller's ReadSchema), it now projects the base
parquet file to only the columns required for a correct MOR merge:

  output_cols ∪ {_hoodie_commit_time, _hoodie_commit_seqno,
                 _hoodie_record_key, <ordering_field>}

Unnecessary columns (_hoodie_partition_path, _hoodie_file_name,
partition value columns, etc.) are pruned before parquet decoding.
Log file batches are also projected to the same schema before
concatenation with the base batch.

For the test table (10 cols) with ReadSchema={id,name,age}, this
reduces each parquet read from 10 → 7 columns.

Changes:
- config/read.rs: add HudiReadConfig::OutputColumns
  key "hoodie.read.output.columns"
- storage/mod.rs: add get_parquet_file_data_with_options that delegates
  to get_parquet_file_stream (projection + log already there);
  fix get_parquet_file_stream to return the projected schema rather than
  the full file schema
- file_group/reader.rs: inject projection in
  read_file_slice_by_base_file_path when OutputColumns is set;
  project log batches to match base schema before concat
- tests/velox_mor_repro_tests.rs: pass hoodie.read.output.columns=id,name,age
  and assert 7-col projected batch (not 10)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@xushiyan xushiyan marked this pull request as draft April 2, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants