Skip to content

Conversation

@xushiyan
Copy link
Member

@xushiyan xushiyan commented Jan 5, 2026

Summary

  • Add projection and row predicate support in streaming reads
  • Improve DataFusion filter pushdown with AND/BETWEEN/OR handling
  • Add V8 table tests and reorganize DataFusion E2E tests

Changes

Core Streaming (crates/core/)

  • Projection pushdown: Column names converted to indices in parquet reader
  • Row predicate support: Now applied in streaming reads (was documented as "not implemented")
  • RowPredicate: Changed from Box to Arc for async cloning
  • New error: InvalidColumn for missing column in projection

DataFusion (crates/datafusion/)

  • Filter pushdown improvements:
    • AND expressions: flattened recursively
    • BETWEEN: converted to >= low AND <= high
    • OR: partial extraction (safe for partition pruning)
  • Partition schema caching: Cached at construction for sync supports_filters_pushdown
  • Exact vs Inexact: Partition column filters return Exact, others return Inexact
  • Test reorganization: E2E tests moved to tests/read_tests.rs
  • V8 tests: 3 new tests for V8 table format

Test plan

  • Core streaming tests: projection, row predicate, combined, invalid column
  • DataFusion unit tests: filter pushdown classification
  • DataFusion E2E tests: V6 + V8 tables with plan verification

Copilot AI review requested due to automatic review settings January 5, 2026 00:05
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the DataFusion integration with several key improvements: schema caching at table construction to eliminate async workarounds, extended filter pushdown support for AND/BETWEEN expressions, intelligent classification of filters as Exact vs Inexact based on partition columns, and improved projection reliability by moving column name-to-index conversion inside the parquet reader.

Key Changes:

  • Schema and partition schema are now cached at HudiDataSource construction time, removing the need for thread spawning workarounds in the synchronous schema() method
  • Filter pushdown now supports AND compound expressions (flattening both sides) and BETWEEN expressions (converting to >= AND <=), with OR expressions explicitly unsupported
  • Partition column filters are marked as Exact (fully handled by pruning) while non-partition filters remain Inexact (require post-filtering)
  • Projection now uses column names converted to indices by the parquet reader itself, ensuring schema consistency and fixing potential mismatches

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/datafusion/src/util/expr.rs Adds recursive filter conversion supporting AND/BETWEEN expressions, with comprehensive test coverage for new expression types
crates/datafusion/src/lib.rs Implements schema caching, custom Debug impl, extended filter pushdown logic, and Exact vs Inexact classification for partition filters
crates/core/src/table/read_options.rs Changes RowPredicate from Box to Arc for clonability in async contexts, updates documentation to reflect full streaming support
crates/core/src/table/mod.rs Propagates row_predicate option to file slice reads (previously ignored)
crates/core/src/storage/mod.rs Adds projection_columns field for name-based projection, implements name-to-index conversion using parquet reader schema
crates/core/src/file_group/reader.rs Implements projection pushdown and row predicate filtering in streaming reads, adds apply_row_predicate helper function
crates/core/tests/table_read_tests.rs Adds comprehensive tests for projection, row predicates, combined features, and invalid column error handling

@xushiyan xushiyan changed the title feat(datafusion): improve schema caching, filter pushdown, and projection reliability feat: support column project and row filter pushdown Jan 5, 2026
@codecov
Copy link

codecov bot commented Jan 5, 2026

Codecov Report

❌ Patch coverage is 80.76923% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.08%. Comparing base (a20771d) to head (2aba9df).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/datafusion/src/lib.rs 75.00% 14 Missing ⚠️
crates/datafusion/src/util/expr.rs 64.00% 9 Missing ⚠️
crates/core/src/file_group/reader.rs 65.00% 7 Missing ⚠️
crates/datafusion/tests/read_tests.rs 95.16% 3 Missing ⚠️
crates/core/src/storage/mod.rs 93.33% 1 Missing ⚠️
crates/core/src/table/mod.rs 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #510      +/-   ##
==========================================
- Coverage   85.13%   85.08%   -0.05%     
==========================================
  Files          66       67       +1     
  Lines        4057     4218     +161     
==========================================
+ Hits         3454     3589     +135     
- Misses        603      629      +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Copy link

@zhangyue19921010 zhangyue19921010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~

@xushiyan xushiyan merged commit 98a8bc0 into apache:main Jan 5, 2026
20 checks passed
@xushiyan xushiyan deleted the projection-pushdown branch January 5, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants