perf: faster metric aggregations and phrase-query intersection by guilload · Pull Request #2975 · quickwit-oss/tantivy

guilload · 2026-06-24T17:25:37Z

Summary

Three independent performance optimizations found while sweeping tantivy's hot paths, each benchmarked before/after. Two are in metric aggregations (the biggest wins) and one is in phrase-query position intersection.

The common thread for the aggregation wins: the reducers were bottlenecked on a serial floating-point dependency chain (Kahan / Welford), which blocks both CPU pipelining and auto-vectorization. Breaking the chain into independent accumulator lanes — merged back with the same compensated/parallel combination already used across segment merges — roughly halves the work with no accuracy regression.

Draft. Benchmarks run on Apple Silicon (aarch64). All existing test suites pass (query:: 240, aggregation:: 237); aggregation accuracy is verified by the existing tolerance tests, including the rounding-sensitive test_aggregation_level1.

Commits

chore(bench) — fix a stale rand::distributions import so the nightly benches compile.
perf(query) — gallop sorted-array intersections in the phrase scorer.
perf(aggregation) — break the Kahan-sum dependency chain with accumulator lanes (sum/avg/min/max/count/stats).
perf(aggregation) — parallelize extended_stats variance (Welford) via lane merge.

Benchmarks

Metric aggregations — `benches/agg_bench.rs`, 1M docs, full cardinality

Benchmark	Before	After	Δ
`average_u64`	3.90 ms	1.98 ms	−49%
`average_f64`	4.07 ms	2.13 ms	−48%
`average_f64_u64`	7.70 ms	3.78 ms	−51%
`stats_f64`	6.30 ms	2.36 ms	−41%
`extendedstats_f64`	~6.5 ms	~3.3 ms	~−27%

Dense-cardinality variants improve ~−34% (average/stats) and ~−22% (extended_stats). extended_stats gains less than stats because Welford has inherent serial work (a per-value division) plus lane-merge overhead.

Phrase position intersection — `intersection_count` / `intersection`, nightly `--features unstable`

Case	Before	After	Δ
`intersection_count` asymmetric (4096 × 4)	2159 ns	66 ns	~33×
`intersection_count` balanced (2048 × 2048)	1464 ns	1456 ns	no change (guard keeps two-pointer)
`intersection` (writes output) asymmetric (4096 × 4)	~2400 ns	294 ns	~8× (residual is the buffer memcpy)
`intersection` balanced (2048 × 2048)	2364 ns	unchanged	no change

The GALLOP_RATIO = 64 guard ensures balanced/dense inputs never regress: galloping is only used when one list is ≥64× the other (the asymmetric regime, e.g. a rare + a frequent phrase term). Galloping on balanced inputs would be ~8× slower due to binary-search cache misses, which the guard avoids.

Correctness

Phrase galloping: proptest test_galloping_matches_scalar checks both the counting and output-writing variants against the linear two-pointer reference over a wide range of (a)symmetric sizes.
Aggregation lanes: the lane→bucket combination reuses the existing merge_fruits math, so results match multi-segment aggregation. All aggregation:: tests pass within the existing 2e-12 tolerance.

Notes / scope

Slop phrase variants (intersection_count_with_slop, _with_carrying_slop) are deliberately left on the two-pointer: range-match + best-match/slop-budget bookkeeping make galloping correctness-risky, and slop>0 phrases are less common.
The bench numbers are on aarch64; the aggregation wins (breaking a serial fp chain) are platform-independent in nature.

🤖 Generated with Claude Code

The `unstable` bench module used `rand::distributions::Alphanumeric`, which moved to `rand::distr` in rand 0.9, breaking compilation of the nightly benches. Required to run the benchmarks in the following commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The phrase scorer intersects sorted position lists with a linear two-pointer merge (O(n+m)). When the two lists differ greatly in size (e.g. a rare term and a frequent one in a phrase), this walks the long list element by element to find a handful of matches. Add a galloping (exponential + binary search) variant, shared via a `gallop_find` helper, and route `intersection_count` / `intersection` to it behind a size guard (`GALLOP_RATIO = 64`): gallop only when one list is >=64x the other, otherwise keep the cache-friendly two-pointer. This avoids the known galloping regression on balanced/dense inputs. Equivalence with the linear reference is checked by a proptest over a wide range of sizes (both small/large branches). Slop variants are left on the two-pointer (range-match + best-match bookkeeping make galloping correctness-risky, and slop>0 phrases are less common). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tor lanes The metric reducer `collect_stats` (shared by sum/avg/min/max/count/stats) folded values one at a time through Kahan compensated summation. The Kahan recurrence carries `sum`/`delta` across every iteration — a strict serial dependency chain that blocks both CPU pipelining and auto-vectorization of the co-located min/max, and it ran over an iterator rather than a slice. Add `ColumnBlockAccessor::vals()` to expose the fetched block as a slice, and reduce it with 4 independent (sum, delta) Kahan lanes + 4 min/max lanes, merged back with the same compensated combination as `merge_fruits`. The four chains run in parallel and min/max vectorize. Accuracy is preserved: it is still Kahan-compensated; only the summation order changes, exactly as it already does when merging across segments. All aggregation tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`extended_stats` collects with Welford's online variance, which recomputes the running mean from the running sum each step — a strictly serial recurrence that cannot be vectorized in place. Add `collect_block`, which accumulates the block into 4 independent `IntermediateExtendedStats` lanes and combines them with the existing `merge_fruits` (Chan parallel-variance combination) — the exact operation already used to merge across segments, so results match multi-segment aggregation. Stays within the tight `EPSILON_FOR_TEST = 2e-12` tolerance; even the rounding-sensitive `test_aggregation_level1` passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

guilload and others added 4 commits June 24, 2026 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf: faster metric aggregations and phrase-query intersection#2975

perf: faster metric aggregations and phrase-query intersection#2975
guilload wants to merge 4 commits into
mainfrom
guilload/claude-perf-low-hanging-fruits

guilload commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

guilload commented Jun 24, 2026

Summary

Commits

Benchmarks

Metric aggregations — benches/agg_bench.rs, 1M docs, full cardinality

Phrase position intersection — intersection_count / intersection, nightly --features unstable

Correctness

Notes / scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Metric aggregations — `benches/agg_bench.rs`, 1M docs, full cardinality

Phrase position intersection — `intersection_count` / `intersection`, nightly `--features unstable`