-
Notifications
You must be signed in to change notification settings - Fork 52
feat: support column project and row filter pushdown #510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the DataFusion integration with several key improvements: schema caching at table construction to eliminate async workarounds, extended filter pushdown support for AND/BETWEEN expressions, intelligent classification of filters as Exact vs Inexact based on partition columns, and improved projection reliability by moving column name-to-index conversion inside the parquet reader.
Key Changes:
- Schema and partition schema are now cached at
HudiDataSourceconstruction time, removing the need for thread spawning workarounds in the synchronousschema()method - Filter pushdown now supports AND compound expressions (flattening both sides) and BETWEEN expressions (converting to >= AND <=), with OR expressions explicitly unsupported
- Partition column filters are marked as
Exact(fully handled by pruning) while non-partition filters remainInexact(require post-filtering) - Projection now uses column names converted to indices by the parquet reader itself, ensuring schema consistency and fixing potential mismatches
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
crates/datafusion/src/util/expr.rs |
Adds recursive filter conversion supporting AND/BETWEEN expressions, with comprehensive test coverage for new expression types |
crates/datafusion/src/lib.rs |
Implements schema caching, custom Debug impl, extended filter pushdown logic, and Exact vs Inexact classification for partition filters |
crates/core/src/table/read_options.rs |
Changes RowPredicate from Box to Arc for clonability in async contexts, updates documentation to reflect full streaming support |
crates/core/src/table/mod.rs |
Propagates row_predicate option to file slice reads (previously ignored) |
crates/core/src/storage/mod.rs |
Adds projection_columns field for name-based projection, implements name-to-index conversion using parquet reader schema |
crates/core/src/file_group/reader.rs |
Implements projection pushdown and row predicate filtering in streaming reads, adds apply_row_predicate helper function |
crates/core/tests/table_read_tests.rs |
Adds comprehensive tests for projection, row predicates, combined features, and invalid column error handling |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #510 +/- ##
==========================================
- Coverage 85.13% 85.08% -0.05%
==========================================
Files 66 67 +1
Lines 4057 4218 +161
==========================================
+ Hits 3454 3589 +135
- Misses 603 629 +26 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
zhangyue19921010
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM~
Summary
Changes
Core Streaming (
crates/core/)RowPredicate: Changed fromBoxtoArcfor async cloningInvalidColumnfor missing column in projectionDataFusion (
crates/datafusion/)>= low AND <= highsupports_filters_pushdownExact, others returnInexacttests/read_tests.rsTest plan