feat: Prune complex/nested predicates via statistics propagation #19609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

2010YOUY01 wants to merge 1 commit into apache:main from 2010YOUY01:stat-propagation-pruning

+2,413 −4

Contributor

2010YOUY01 commented Jan 2, 2026 •

edited

Loading

Which issue does this PR close?

The initial work of #19487

Rationale for this change

See the issue for the rationale, and design considerations.

For PR structure, start with datafusion/physical-expr-common/src/physical_expr/pruning.rs 's module-level comment, and follow along.

What changes are included in this PR?

The core change in this PR is around a small few hundreds LoC from estimation, the PR diff is mainly tests and docs.

Defined core APIs/data structures for stat-propagation-based predicate pruning
Implemented statistics pruning on:
- Literals (like 3)
- Column references (like c1)
- Comparison operators >, <, =, >=, <=

And we now support pruning for expressions like

c1 > 1
c1 >= c2

The issue also includes some thoughts on future implementation plans.

Are these changes tested?

UTs

Are there any user-facing changes?

No


          Statistics propagation based predicate pruning

3cc4e8c

github-actions bot added the physical-expr label

2010YOUY01 mentioned this pull request

Proposal: Prune complex predicates by propagating column statistics #19487

Open

2010YOUY01 changed the title ~~feat: Predicate pruning via statistics propagation~~ feat: Prune complex/nested predicates via statistics propagation

adriangb reviewed

View reviewed changes

Contributor

adriangb left a comment

Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.

Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?

datafusion/physical-expr-common/src/physical_expr/pruning.rs

    
              /// pruning effectiveness.

              #[derive(Debug, Clone)]

              pub struct PruningResults {

                  results: Option<BooleanArray>,

Contributor

adriangb Jan 2, 2026

A comment here explaining the meaning of each one of the 3 boolean states would be helpful, maybe linking to PruningOutcome

Contributor Author

2010YOUY01 Jan 3, 2026

I put them all in the struct comment of PruningResults, will also add links in the fields and other related structs

datafusion/physical-expr-common/src/physical_expr/pruning.rs

Comment on lines +117 to +126

    
                  pub fn new(array: Option<BooleanArray>, num_containers: usize) -> Self {

                      debug_assert_eq!(

                          array.as_ref().map(|a| a.len()).unwrap_or(num_containers),

                          num_containers

                      );

                      Self {

                          results: array,

                          num_containers,

                      }

                  }

Contributor

adriangb Jan 2, 2026

When would this be called with (None, 123)? It seems like that usage is only ever internal from pub fn none(). I would make the public constructor new(array: BooleanArrray) and make none() initialize the struct itself.

datafusion/physical-expr-common/src/physical_expr/pruning.rs

    
                  pub fn is_empty(&self) -> bool {

                      self.len() == 0

                  }

              }

Contributor

adriangb Jan 2, 2026

Maybe an iter_results(&self) -> impl Iterator<Item = PruningOutcome>?

datafusion/physical-expr-common/src/physical_expr/pruning.rs

Comment on lines +174 to +179

    
              impl From<BooleanArray> for PruningResults {

                  fn from(array: BooleanArray) -> Self {

                      let len = array.len();

                      PruningResults::new(Some(array), len)

                  }

              }

Contributor

adriangb Jan 2, 2026

This seems like it might be a bit magic and an explicit constructor is better

datafusion/physical-expr-common/src/physical_expr/pruning.rs

Comment on lines +251 to +252

    
                  pub range_stats: Option<RangeStats>,

                  pub null_stats: Option<NullStats>,

Contributor

adriangb Jan 2, 2026

I think it's important to point out that if null stats or missing (NullPresence::UnknownOrMixed) we cannot make any inferences from the min/max values, they should be treated as missing as well.

Contributor Author

2010YOUY01 Jan 3, 2026

The actual inference logic is more aggressive than the algorithm you have described, it's implemented in https://github.com/apache/datafusion/pull/19609/changes#diff-32f7f18dcd86a268e7e1e0134eae6ae002bd42e61180cfabd60944566b10f6d8R660

I'll add more comments here also.

datafusion/physical-expr/src/expressions/binary.rs

    
              ///

              /// # Errors

              /// Returns Internal Error if unsupported operator is provided.

              fn compare_ranges(

Contributor

adriangb Jan 2, 2026

Some unit tests for this method specifically that ensure 100% coverage would be great

Contributor Author

2010YOUY01 commented Jan 3, 2026 •

edited

Loading

Thank you for the review, those feedbacks make sense to me, I'll batch them later

Some initial comments. Need to read a couple more times to actually wrap my had around how it's working.

Please let me know if anything is unclear. I’m trying to make both the implementation and the documentation clearer, but the logic and edge cases for this feature are admittedly quite tricky.

Is the plan to make multiple subsequent PRs to add more handling e.g. for Like expressions, UDFs, etc. and then eventually once we reach feature parity replace the current system?

Yes — the initial milestone should be reaching coverage equivalent to the existing PruningPredicate implementation, so we can reuse the existing tests and gain more confidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels