perf: skip_stats backport and correctness/perf improvements#15
Open
radustoenescu wants to merge 7 commits intohstack:mainfrom
Open
perf: skip_stats backport and correctness/perf improvements#15radustoenescu wants to merge 7 commits intohstack:mainfrom
radustoenescu wants to merge 7 commits intohstack:mainfrom
Conversation
Signed-off-by: Adrian Tanase <atanase@adobe.com>
…size - Remove unused StructArray import - Replace unwrap() with ok().flatten() on sum_array_checked to avoid panic on arithmetic overflow - Clamp negative i64 to 0 before casting to usize
Precompute cumulative row offsets at construction time and use binary search (partition_point) for O(log b) lookup instead of O(b) linear scan per call.
Split the shuffle/limit/prune loop into two phases: 1. Classify kept files into with-stats and without-stats lists 2. Shuffle with-stats (when enabled), then apply limit cutoff Fixes a correctness bug: with shuffled indices, the old break would stop iteration early, missing without-stats files that hadn't been visited yet. For example, with files [A(stats), B(no-stats), C(stats)] shuffled to [C, A, B], if C alone satisfied the limit the break fired before B was seen — so the without-stats fallback was empty even though the limit might not actually be met (C's stats could overestimate).
- files_from() missing skip_stats for consistency with files() - read_adds_size() relies on flattened scan-output schema - DELTA_RS_SHUFFLE_FILES env var is too permissive, no seed control
Replace unwrap_or(0) with expect() to surface the invariant that files in with_stats always have num_records when a limit is set. Phase 1 routes all None-stats files to without_stats, so this is unreachable — but unwrap_or(0) would silently miscount if the routing logic is ever changed.
skip_stats removes stats_parsed (including numRecords) from checkpoint reads, which disables limit-based file pruning — all files go through the without_stats fallback. This is intentional: for wide tables the checkpoint read speedup outweighs the cost of scanning extra files. Limit is still enforced at scan time, so correctness is unaffected.
|
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
skip_statsfrom delta-kernel 0.20 — skip readingstats_parsedfrom checkpoint parquet files for wide table speedup (trade-off: limit-based file pruning disabled, documented inline)breakon shuffled indices caused incompletewithout_statsfallback; now classify first, shuffle secondLogDataHandler::get()uses binary search via precomputed cumulative row offsets instead of linear scanread_adds_sizeno longer panics on arithmetic overflow; negative values clamped beforeusizecastunwrap_or(0)withexpect()in phase-2 limit loop to make routing invariant explicitStructArrayimportTest plan
cargo test -p deltalake-core --lib— 215 passed, 0 failed