Skip to content

perf: skip_stats backport and correctness/perf improvements#15

Open
radustoenescu wants to merge 7 commits intohstack:mainfrom
radustoenescu:skip-stats-rebased
Open

perf: skip_stats backport and correctness/perf improvements#15
radustoenescu wants to merge 7 commits intohstack:mainfrom
radustoenescu:skip-stats-rebased

Conversation

@radustoenescu
Copy link
Copy Markdown

@radustoenescu radustoenescu commented Apr 9, 2026

Summary

  • perf: backport skip_stats from delta-kernel 0.20 — skip reading stats_parsed from checkpoint parquet files for wide table speedup (trade-off: limit-based file pruning disabled, documented inline)
  • fix: two-phase file selection — break on shuffled indices caused incomplete without_stats fallback; now classify first, shuffle second
  • perf: LogDataHandler::get() uses binary search via precomputed cumulative row offsets instead of linear scan
  • fix: read_adds_size no longer panics on arithmetic overflow; negative values clamped before usize cast
  • fix: replace unwrap_or(0) with expect() in phase-2 limit loop to make routing invariant explicit
  • fix: remove unused StructArray import

Test plan

  • cargo test -p deltalake-core --lib — 215 passed, 0 failed

aditanase and others added 7 commits April 9, 2026 15:34
Signed-off-by: Adrian Tanase <atanase@adobe.com>
…size

- Remove unused StructArray import
- Replace unwrap() with ok().flatten() on sum_array_checked to avoid
  panic on arithmetic overflow
- Clamp negative i64 to 0 before casting to usize
Precompute cumulative row offsets at construction time and use
binary search (partition_point) for O(log b) lookup instead of
O(b) linear scan per call.
Split the shuffle/limit/prune loop into two phases:
1. Classify kept files into with-stats and without-stats lists
2. Shuffle with-stats (when enabled), then apply limit cutoff

Fixes a correctness bug: with shuffled indices, the old break would
stop iteration early, missing without-stats files that hadn't been
visited yet. For example, with files [A(stats), B(no-stats), C(stats)]
shuffled to [C, A, B], if C alone satisfied the limit the break fired
before B was seen — so the without-stats fallback was empty even though
the limit might not actually be met (C's stats could overestimate).
- files_from() missing skip_stats for consistency with files()
- read_adds_size() relies on flattened scan-output schema
- DELTA_RS_SHUFFLE_FILES env var is too permissive, no seed control
Replace unwrap_or(0) with expect() to surface the invariant that
files in with_stats always have num_records when a limit is set.
Phase 1 routes all None-stats files to without_stats, so this is
unreachable — but unwrap_or(0) would silently miscount if the
routing logic is ever changed.
skip_stats removes stats_parsed (including numRecords) from checkpoint
reads, which disables limit-based file pruning — all files go through
the without_stats fallback. This is intentional: for wide tables the
checkpoint read speedup outweighs the cost of scanning extra files.
Limit is still enforced at scan time, so correctness is unaffected.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@radustoenescu radustoenescu changed the title [HSTACK] skip_stats backport + correctness/perf fixes perf: skip_stats backport and correctness/perf improvements Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants