Skip to content

perf: improve datafusion integration#548

Draft
xushiyan wants to merge 15 commits intoapache:mainfrom
xushiyan:df-optimize
Draft

perf: improve datafusion integration#548
xushiyan wants to merge 15 commits intoapache:mainfrom
xushiyan:df-optimize

Conversation

@xushiyan
Copy link
Copy Markdown
Member

@xushiyan xushiyan commented Mar 19, 2026

Description

Improve DataFusion integration performance by reducing I/O and enabling query optimization.

  • Estimate file stats (row count, byte size) from a single Parquet footer sample instead of reading every file
  • Use metadata table for file listing with inline stats population, removing per-file load_metadata_if_needed
  • Expose TableProvider::statistics() so DataFusion can make broadcast/hash join decisions
  • Cache metadata table instance via Arc<OnceCell<Table>> across clones

How are the changes test-covered

  • Automated tests (unit and/or integration tests)
  • Manual tests
    • TPC-H SF10 and SF100 benchmarks (DataFusion vs Spark on GCS)

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 59.09091% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.43%. Comparing base (ae09718) to head (3c7b49a).

Files with missing lines Patch % Lines
crates/core/src/table/fs_view.rs 52.72% 26 Missing ⚠️
crates/core/src/table/mod.rs 48.93% 24 Missing ⚠️
crates/datafusion/src/lib.rs 85.71% 2 Missing ⚠️
crates/core/src/file_group/builder.rs 90.00% 1 Missing ⚠️
crates/core/src/table/builder.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #548      +/-   ##
==========================================
- Coverage   81.32%   80.43%   -0.89%     
==========================================
  Files          75       75              
  Lines        5022     5112      +90     
==========================================
+ Hits         4084     4112      +28     
- Misses        938     1000      +62     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xushiyan
Copy link
Copy Markdown
Member Author

xushiyan commented Mar 22, 2026

SF100 on GCS

TPC-H Query Runtime Comparison
================================================================================

Q01  datafusion+hudi-rs |████                                     |    8088.7 ms
     spark+hudi         |████████████████████████████████████████ |   72330.8 ms

Q02  datafusion+hudi-rs |████                                     |    6342.2 ms
     spark+hudi         |███                                      |    6239.2 ms

Q03  datafusion+hudi-rs |█████                                    |    8171.8 ms
     spark+hudi         |█████████                                |   16335.0 ms

Q04  datafusion+hudi-rs |█████                                    |    8397.4 ms
     spark+hudi         |██████                                   |   11742.0 ms

Q05  datafusion+hudi-rs |████████                                 |   14317.5 ms
     spark+hudi         |█████████████████████████                |   45818.1 ms

Q06  datafusion+hudi-rs |█████                                    |    9707.9 ms
     spark+hudi         |████                                     |    7106.6 ms

Q07  datafusion+hudi-rs |████████████████████                     |   36458.1 ms
     spark+hudi         |████████                                 |   13741.2 ms

Q08  datafusion+hudi-rs |███████                                  |   13088.1 ms
     spark+hudi         |███████                                  |   13399.6 ms

Q09  datafusion+hudi-rs |█████████████                            |   23797.2 ms
     spark+hudi         |██████████████                           |   25558.0 ms

Q10  datafusion+hudi-rs |██████                                   |   10642.9 ms
     spark+hudi         |██████████                               |   17589.4 ms

Q11  datafusion+hudi-rs |███                                      |    5512.9 ms
     spark+hudi         |███                                      |    4696.0 ms

Q12  datafusion+hudi-rs |█████                                    |    9857.6 ms
     spark+hudi         |██████                                   |    9990.7 ms

Q13  datafusion+hudi-rs |██                                       |    3617.3 ms
     spark+hudi         |███████████                              |   19220.5 ms

Q14  datafusion+hudi-rs |███                                      |    4938.1 ms
     spark+hudi         |█████                                    |    8191.5 ms

Q15  datafusion+hudi-rs |█████                                    |    9376.0 ms
     spark+hudi         |██████████                               |   17637.7 ms

Q16  datafusion+hudi-rs |██                                       |    3270.9 ms
     spark+hudi         |███                                      |    6211.2 ms

Q17  datafusion+hudi-rs |██████                                   |   10998.5 ms
     spark+hudi         |█████████████████████                    |   38783.5 ms

Q18  datafusion+hudi-rs |████████████████████                     |   36529.1 ms
     spark+hudi         |██████████████████████████               |   47237.0 ms

Q19  datafusion+hudi-rs |████                                     |    7228.1 ms
     spark+hudi         |██████                                   |    9953.5 ms

Q20  datafusion+hudi-rs |████                                     |    7368.9 ms
     spark+hudi         |█████                                    |    9662.2 ms

Q21  datafusion+hudi-rs |████████████                             |   22132.1 ms
     spark+hudi         |████████████████████████                 |   44257.4 ms

Q22  datafusion+hudi-rs |█                                        |    2374.2 ms
     spark+hudi         |█████                                    |    8169.2 ms

Summary
--------------------------------------------------------------------------------

Tot  datafusion+hudi-rs |███████████████████████                  |  262215.4 ms
     spark+hudi         |████████████████████████████████████████ |  453870.4 ms

Geo  datafusion+hudi-rs |████████████████████████                 |    9290.2 ms
     spark+hudi         |████████████████████████████████████████ |   15327.1 ms

@xushiyan xushiyan linked an issue Mar 23, 2026 that may be closed by this pull request
@xushiyan xushiyan added this to the release-0.5.0 milestone Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Supply table statistics for Hudi DataFusion integration

1 participant