Skip to content

perf: faster metric aggregations and phrase-query intersection#2975

Draft
guilload wants to merge 4 commits into
mainfrom
guilload/claude-perf-low-hanging-fruits
Draft

perf: faster metric aggregations and phrase-query intersection#2975
guilload wants to merge 4 commits into
mainfrom
guilload/claude-perf-low-hanging-fruits

Conversation

@guilload

Copy link
Copy Markdown
Member

Summary

Three independent performance optimizations found while sweeping tantivy's hot paths, each benchmarked before/after. Two are in metric aggregations (the biggest wins) and one is in phrase-query position intersection.

The common thread for the aggregation wins: the reducers were bottlenecked on a serial floating-point dependency chain (Kahan / Welford), which blocks both CPU pipelining and auto-vectorization. Breaking the chain into independent accumulator lanes — merged back with the same compensated/parallel combination already used across segment merges — roughly halves the work with no accuracy regression.

Draft. Benchmarks run on Apple Silicon (aarch64). All existing test suites pass (query:: 240, aggregation:: 237); aggregation accuracy is verified by the existing tolerance tests, including the rounding-sensitive test_aggregation_level1.

Commits

  1. chore(bench) — fix a stale rand::distributions import so the nightly benches compile.
  2. perf(query) — gallop sorted-array intersections in the phrase scorer.
  3. perf(aggregation) — break the Kahan-sum dependency chain with accumulator lanes (sum/avg/min/max/count/stats).
  4. perf(aggregation) — parallelize extended_stats variance (Welford) via lane merge.

Benchmarks

Metric aggregations — benches/agg_bench.rs, 1M docs, full cardinality

Benchmark Before After Δ
average_u64 3.90 ms 1.98 ms −49%
average_f64 4.07 ms 2.13 ms −48%
average_f64_u64 7.70 ms 3.78 ms −51%
stats_f64 6.30 ms 2.36 ms −41%
extendedstats_f64 ~6.5 ms ~3.3 ms ~−27%

Dense-cardinality variants improve ~−34% (average/stats) and ~−22% (extended_stats). extended_stats gains less than stats because Welford has inherent serial work (a per-value division) plus lane-merge overhead.

Phrase position intersection — intersection_count / intersection, nightly --features unstable

Case Before After Δ
intersection_count asymmetric (4096 × 4) 2159 ns 66 ns ~33×
intersection_count balanced (2048 × 2048) 1464 ns 1456 ns no change (guard keeps two-pointer)
intersection (writes output) asymmetric (4096 × 4) ~2400 ns 294 ns ~8× (residual is the buffer memcpy)
intersection balanced (2048 × 2048) 2364 ns unchanged no change

The GALLOP_RATIO = 64 guard ensures balanced/dense inputs never regress: galloping is only used when one list is ≥64× the other (the asymmetric regime, e.g. a rare + a frequent phrase term). Galloping on balanced inputs would be ~8× slower due to binary-search cache misses, which the guard avoids.

Correctness

  • Phrase galloping: proptest test_galloping_matches_scalar checks both the counting and output-writing variants against the linear two-pointer reference over a wide range of (a)symmetric sizes.
  • Aggregation lanes: the lane→bucket combination reuses the existing merge_fruits math, so results match multi-segment aggregation. All aggregation:: tests pass within the existing 2e-12 tolerance.

Notes / scope

  • Slop phrase variants (intersection_count_with_slop, _with_carrying_slop) are deliberately left on the two-pointer: range-match + best-match/slop-budget bookkeeping make galloping correctness-risky, and slop>0 phrases are less common.
  • The bench numbers are on aarch64; the aggregation wins (breaking a serial fp chain) are platform-independent in nature.

🤖 Generated with Claude Code

guilload and others added 4 commits June 24, 2026 13:24
The `unstable` bench module used `rand::distributions::Alphanumeric`,
which moved to `rand::distr` in rand 0.9, breaking compilation of the
nightly benches. Required to run the benchmarks in the following commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The phrase scorer intersects sorted position lists with a linear
two-pointer merge (O(n+m)). When the two lists differ greatly in size
(e.g. a rare term and a frequent one in a phrase), this walks the long
list element by element to find a handful of matches.

Add a galloping (exponential + binary search) variant, shared via a
`gallop_find` helper, and route `intersection_count` / `intersection`
to it behind a size guard (`GALLOP_RATIO = 64`): gallop only when one
list is >=64x the other, otherwise keep the cache-friendly two-pointer.
This avoids the known galloping regression on balanced/dense inputs.

Equivalence with the linear reference is checked by a proptest over a
wide range of sizes (both small/large branches). Slop variants are left
on the two-pointer (range-match + best-match bookkeeping make galloping
correctness-risky, and slop>0 phrases are less common).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tor lanes

The metric reducer `collect_stats` (shared by sum/avg/min/max/count/stats)
folded values one at a time through Kahan compensated summation. The Kahan
recurrence carries `sum`/`delta` across every iteration — a strict serial
dependency chain that blocks both CPU pipelining and auto-vectorization of
the co-located min/max, and it ran over an iterator rather than a slice.

Add `ColumnBlockAccessor::vals()` to expose the fetched block as a slice,
and reduce it with 4 independent (sum, delta) Kahan lanes + 4 min/max lanes,
merged back with the same compensated combination as `merge_fruits`. The
four chains run in parallel and min/max vectorize.

Accuracy is preserved: it is still Kahan-compensated; only the summation
order changes, exactly as it already does when merging across segments.
All aggregation tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`extended_stats` collects with Welford's online variance, which recomputes
the running mean from the running sum each step — a strictly serial
recurrence that cannot be vectorized in place.

Add `collect_block`, which accumulates the block into 4 independent
`IntermediateExtendedStats` lanes and combines them with the existing
`merge_fruits` (Chan parallel-variance combination) — the exact operation
already used to merge across segments, so results match multi-segment
aggregation.

Stays within the tight `EPSILON_FOR_TEST = 2e-12` tolerance; even the
rounding-sensitive `test_aggregation_level1` passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant