perf: faster metric aggregations and phrase-query intersection#2975
Draft
guilload wants to merge 4 commits into
Draft
perf: faster metric aggregations and phrase-query intersection#2975guilload wants to merge 4 commits into
guilload wants to merge 4 commits into
Conversation
The `unstable` bench module used `rand::distributions::Alphanumeric`, which moved to `rand::distr` in rand 0.9, breaking compilation of the nightly benches. Required to run the benchmarks in the following commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The phrase scorer intersects sorted position lists with a linear two-pointer merge (O(n+m)). When the two lists differ greatly in size (e.g. a rare term and a frequent one in a phrase), this walks the long list element by element to find a handful of matches. Add a galloping (exponential + binary search) variant, shared via a `gallop_find` helper, and route `intersection_count` / `intersection` to it behind a size guard (`GALLOP_RATIO = 64`): gallop only when one list is >=64x the other, otherwise keep the cache-friendly two-pointer. This avoids the known galloping regression on balanced/dense inputs. Equivalence with the linear reference is checked by a proptest over a wide range of sizes (both small/large branches). Slop variants are left on the two-pointer (range-match + best-match bookkeeping make galloping correctness-risky, and slop>0 phrases are less common). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tor lanes The metric reducer `collect_stats` (shared by sum/avg/min/max/count/stats) folded values one at a time through Kahan compensated summation. The Kahan recurrence carries `sum`/`delta` across every iteration — a strict serial dependency chain that blocks both CPU pipelining and auto-vectorization of the co-located min/max, and it ran over an iterator rather than a slice. Add `ColumnBlockAccessor::vals()` to expose the fetched block as a slice, and reduce it with 4 independent (sum, delta) Kahan lanes + 4 min/max lanes, merged back with the same compensated combination as `merge_fruits`. The four chains run in parallel and min/max vectorize. Accuracy is preserved: it is still Kahan-compensated; only the summation order changes, exactly as it already does when merging across segments. All aggregation tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`extended_stats` collects with Welford's online variance, which recomputes the running mean from the running sum each step — a strictly serial recurrence that cannot be vectorized in place. Add `collect_block`, which accumulates the block into 4 independent `IntermediateExtendedStats` lanes and combines them with the existing `merge_fruits` (Chan parallel-variance combination) — the exact operation already used to merge across segments, so results match multi-segment aggregation. Stays within the tight `EPSILON_FOR_TEST = 2e-12` tolerance; even the rounding-sensitive `test_aggregation_level1` passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independent performance optimizations found while sweeping tantivy's hot paths, each benchmarked before/after. Two are in metric aggregations (the biggest wins) and one is in phrase-query position intersection.
The common thread for the aggregation wins: the reducers were bottlenecked on a serial floating-point dependency chain (Kahan / Welford), which blocks both CPU pipelining and auto-vectorization. Breaking the chain into independent accumulator lanes — merged back with the same compensated/parallel combination already used across segment merges — roughly halves the work with no accuracy regression.
Commits
chore(bench)— fix a stalerand::distributionsimport so the nightly benches compile.perf(query)— gallop sorted-array intersections in the phrase scorer.perf(aggregation)— break the Kahan-sum dependency chain with accumulator lanes (sum/avg/min/max/count/stats).perf(aggregation)— parallelizeextended_statsvariance (Welford) via lane merge.Benchmarks
Metric aggregations —
benches/agg_bench.rs, 1M docs, full cardinalityaverage_u64average_f64average_f64_u64stats_f64extendedstats_f64Dense-cardinality variants improve ~−34% (
average/stats) and ~−22% (extended_stats).extended_statsgains less thanstatsbecause Welford has inherent serial work (a per-value division) plus lane-merge overhead.Phrase position intersection —
intersection_count/intersection, nightly--features unstableintersection_countasymmetric (4096 × 4)intersection_countbalanced (2048 × 2048)intersection(writes output) asymmetric (4096 × 4)intersectionbalanced (2048 × 2048)The
GALLOP_RATIO = 64guard ensures balanced/dense inputs never regress: galloping is only used when one list is ≥64× the other (the asymmetric regime, e.g. a rare + a frequent phrase term). Galloping on balanced inputs would be ~8× slower due to binary-search cache misses, which the guard avoids.Correctness
test_galloping_matches_scalarchecks both the counting and output-writing variants against the linear two-pointer reference over a wide range of (a)symmetric sizes.merge_fruitsmath, so results match multi-segment aggregation. Allaggregation::tests pass within the existing2e-12tolerance.Notes / scope
intersection_count_with_slop,_with_carrying_slop) are deliberately left on the two-pointer: range-match + best-match/slop-budget bookkeeping make galloping correctness-risky, and slop>0 phrases are less common.🤖 Generated with Claude Code