Skip to content

Conversation

@2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Jan 2, 2026

Which issue does this PR close?

The initial work of #19487

Rationale for this change

See the issue for the rationale, and design considerations.

For PR structure, start with datafusion/physical-expr-common/src/physical_expr/pruning.rs 's module-level comment, and follow along.

What changes are included in this PR?

The core change in this PR is around a small few hundreds LoC from estimation, the PR diff is mainly tests and docs.

  • Defined core APIs/data structures for stat-propagation-based predicate pruning
  • Implemented statistics pruning on:
    • Literals (like 3)
    • Column references (like c1)
    • Comparison operators >, <, =, >=, <=

And we now support pruning for expressions like

  • c1 > 1
  • c1 >= c2

The issue also includes some thoughts on future implementation plans.

Are these changes tested?

UTs

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Jan 2, 2026
@2010YOUY01 2010YOUY01 changed the title feat: Predicate pruning via statistics propagation feat: Prune complex/nested predicates via statistics propagation Jan 2, 2026
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.

Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?

/// pruning effectiveness.
#[derive(Debug, Clone)]
pub struct PruningResults {
results: Option<BooleanArray>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here explaining the meaning of each one of the 3 boolean states would be helpful, maybe linking to PruningOutcome

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put them all in the struct comment of PruningResults, will also add links in the fields and other related structs

Comment on lines +117 to +126
pub fn new(array: Option<BooleanArray>, num_containers: usize) -> Self {
debug_assert_eq!(
array.as_ref().map(|a| a.len()).unwrap_or(num_containers),
num_containers
);
Self {
results: array,
num_containers,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this be called with (None, 123)? It seems like that usage is only ever internal from pub fn none(). I would make the public constructor new(array: BooleanArrray) and make none() initialize the struct itself.

pub fn is_empty(&self) -> bool {
self.len() == 0
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an iter_results(&self) -> impl Iterator<Item = PruningOutcome>?

Comment on lines +174 to +179
impl From<BooleanArray> for PruningResults {
fn from(array: BooleanArray) -> Self {
let len = array.len();
PruningResults::new(Some(array), len)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it might be a bit magic and an explicit constructor is better

Comment on lines +251 to +252
pub range_stats: Option<RangeStats>,
pub null_stats: Option<NullStats>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to point out that if null stats or missing (NullPresence::UnknownOrMixed) we cannot make any inferences from the min/max values, they should be treated as missing as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual inference logic is more aggressive than the algorithm you have described, it's implemented in https://github.com/apache/datafusion/pull/19609/changes#diff-32f7f18dcd86a268e7e1e0134eae6ae002bd42e61180cfabd60944566b10f6d8R660

I'll add more comments here also.

///
/// # Errors
/// Returns Internal Error if unsupported operator is provided.
fn compare_ranges(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some unit tests for this method specifically that ensure 100% coverage would be great

@2010YOUY01
Copy link
Contributor Author

2010YOUY01 commented Jan 3, 2026

Thank you for the review, those feedbacks make sense to me, I'll batch them later

Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.

Please let me know if anything is unclear. I’m trying to make both the implementation and the documentation clearer, but the logic and edge cases for this feature are admittedly quite tricky.

Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?

Yes — the initial milestone should be reaching coverage equivalent to the existing PruningPredicate implementation, so we can reuse the existing tests and gain more confidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants