-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: Prune complex/nested predicates via statistics propagation #19609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.
Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?
| /// pruning effectiveness. | ||
| #[derive(Debug, Clone)] | ||
| pub struct PruningResults { | ||
| results: Option<BooleanArray>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here explaining the meaning of each one of the 3 boolean states would be helpful, maybe linking to PruningOutcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put them all in the struct comment of PruningResults, will also add links in the fields and other related structs
| pub fn new(array: Option<BooleanArray>, num_containers: usize) -> Self { | ||
| debug_assert_eq!( | ||
| array.as_ref().map(|a| a.len()).unwrap_or(num_containers), | ||
| num_containers | ||
| ); | ||
| Self { | ||
| results: array, | ||
| num_containers, | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this be called with (None, 123)? It seems like that usage is only ever internal from pub fn none(). I would make the public constructor new(array: BooleanArrray) and make none() initialize the struct itself.
| pub fn is_empty(&self) -> bool { | ||
| self.len() == 0 | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an iter_results(&self) -> impl Iterator<Item = PruningOutcome>?
| impl From<BooleanArray> for PruningResults { | ||
| fn from(array: BooleanArray) -> Self { | ||
| let len = array.len(); | ||
| PruningResults::new(Some(array), len) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it might be a bit magic and an explicit constructor is better
| pub range_stats: Option<RangeStats>, | ||
| pub null_stats: Option<NullStats>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to point out that if null stats or missing (NullPresence::UnknownOrMixed) we cannot make any inferences from the min/max values, they should be treated as missing as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual inference logic is more aggressive than the algorithm you have described, it's implemented in https://github.com/apache/datafusion/pull/19609/changes#diff-32f7f18dcd86a268e7e1e0134eae6ae002bd42e61180cfabd60944566b10f6d8R660
I'll add more comments here also.
| /// | ||
| /// # Errors | ||
| /// Returns Internal Error if unsupported operator is provided. | ||
| fn compare_ranges( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some unit tests for this method specifically that ensure 100% coverage would be great
|
Thank you for the review, those feedbacks make sense to me, I'll batch them later
Please let me know if anything is unclear. I’m trying to make both the implementation and the documentation clearer, but the logic and edge cases for this feature are admittedly quite tricky.
Yes — the initial milestone should be reaching coverage equivalent to the existing |
Which issue does this PR close?
The initial work of #19487
Rationale for this change
See the issue for the rationale, and design considerations.
For PR structure, start with datafusion/physical-expr-common/src/physical_expr/pruning.rs 's module-level comment, and follow along.
What changes are included in this PR?
The core change in this PR is around a small few hundreds LoC from estimation, the PR diff is mainly tests and docs.
And we now support pruning for expressions like
The issue also includes some thoughts on future implementation plans.
Are these changes tested?
UTs
Are there any user-facing changes?
No