Speed up range-query intersections via seek_danger on RangeDocSet (up to ~50x faster)#2963
Speed up range-query intersections via seek_danger on RangeDocSet (up to ~50x faster)#2963PSeitz-dd wants to merge 1 commit into
Conversation
882ea34 to
7577a0b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 882ea344ed
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // `target` is not in the docset. The next match is strictly greater than `target`, so | ||
| // `target + 1` is a valid lower bound. We may leave the docset in an invalid state. | ||
| SeekDangerResult::SeekLowerBound(target + 1) |
There was a problem hiding this comment.
Avoid returning only target+1 on sparse misses
When a sparse fast-field range is the non-leading side of an intersection whose lead scorer is moderately dense, Intersection::advance uses this returned lower bound as the next candidate, so returning only target + 1 forces a point lookup for nearly every lead posting until the next range hit. Before this override, the default seek_danger called seek, which let RangeDocSet use its batched range scan to jump to the next matching range doc; this change can therefore turn sparse-range intersections such as a ~50%-matching term AND a handful-of-docs range into millions of per-doc lookups.
Useful? React with 👍 / 👎.
7577a0b to
a398adc
Compare
… to ~50x faster) A regular seek on RangeDocSet is costly: on a miss it fetches blocks and scans the column forward to materialize the next matching doc. As a non-leading docset in an intersection that work is wasted — the driver only asks "does this candidate match?". seek_danger answers that with a cheap point lookup via Column::values_for_doc, returning a lower bound on a miss and leaving forward progress to the caller. Forward seek_danger through ConstScorer. Benchmarks (bool_queries_with_range, _all_results / DocSetCollector): ``` dense and 0.1% a a_AND_num_rand:[0_TO_9]_all_results Avg: 0.0827ms (-4.60%) Median: 0.0825ms (-4.82%) [0.0809ms .. 0.0891ms] Output: 43 a_AND_num_asc:[0_TO_9]_all_results Avg: 0.1937ms (-3.70%) Median: 0.1930ms (-3.59%) [0.1806ms .. 0.2044ms] Output: 100 a_AND_num_rand_fast:[0_TO_9]_all_results Avg: 0.0367ms (-92.67%) Median: 0.0365ms (-92.65%) [0.0340ms .. 0.0398ms] Output: 43 a_AND_num_asc_fast:[0_TO_9]_all_results Avg: 0.1052ms (-98.05%) Median: 0.1050ms (-97.98%) [0.1009ms .. 0.1117ms] Output: 100 num_rand_fast:[0_TO_9]_AND_num_asc_fast:[0_TO_9]_all_results Avg: 2.7147ms (-51.42%) Median: 2.7075ms (-49.58%) [2.6806ms .. 2.7799ms] Output: 968 dense and 1% a a_AND_num_rand:[0_TO_9]_all_results Avg: 0.4373ms (-9.71%) Median: 0.4357ms (-10.12%) [0.4117ms .. 0.4711ms] Output: 463 a_AND_num_asc:[0_TO_9]_all_results Avg: 0.2342ms (-2.50%) Median: 0.2338ms (-2.56%) [0.2247ms .. 0.2452ms] Output: 1_054 a_AND_num_rand_fast:[0_TO_9]_all_results Avg: 0.3956ms (-82.86%) Median: 0.3943ms (-82.90%) [0.3815ms .. 0.4119ms] Output: 463 a_AND_num_asc_fast:[0_TO_9]_all_results Avg: 0.4896ms (-91.16%) Median: 0.4862ms (-90.81%) [0.4797ms .. 0.5084ms] Output: 1_054 num_rand_fast:[0_TO_9]_AND_num_asc_fast:[0_TO_9]_all_results Avg: 2.7108ms (-50.81%) Median: 2.6925ms (-49.51%) [2.6688ms .. 2.7868ms] Output: 968 dense and 10% a a_AND_num_rand:[0_TO_9]_all_results Avg: 0.9869ms (-3.71%) Median: 0.9833ms (-3.83%) [0.9518ms .. 1.1218ms] Output: 4_914 a_AND_num_asc:[0_TO_9]_all_results Avg: 0.6352ms (-3.74%) Median: 0.6363ms (-3.32%) [0.6158ms .. 0.6488ms] Output: 10_152 a_AND_num_rand_fast:[0_TO_9]_all_results Avg: 3.1264ms (+0.39%) Median: 3.1466ms (+1.34%) [3.0261ms .. 3.2051ms] Output: 4_914 a_AND_num_asc_fast:[0_TO_9]_all_results Avg: 4.1547ms (-31.12%) Median: 4.0933ms (-28.55%) [3.7648ms .. 4.7600ms] Output: 10_152 num_rand_fast:[0_TO_9]_AND_num_asc_fast:[0_TO_9]_all_results Avg: 2.6973ms (-52.30%) Median: 2.6901ms (-49.86%) [2.6689ms .. 2.7677ms] Output: 968 ``` Gains are largest when the range query is the non-leading docset of a low-cardinality intersection.
a398adc to
ca0ed87
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ca0ed87d79
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
A regular seek on RangeDocSet is costly: on a miss it fetches blocks and
scans the column forward to materialize the next matching doc. As a
non-leading docset in an intersection that work is wasted — the driver only
asks "does this candidate match?". seek_danger answers that with a cheap
point lookup via Column::values_for_doc, returning a lower bound on a miss
and leaving forward progress to the caller.
Forward seek_danger through ConstScorer.
Benchmarks (bool_queries_with_range, _all_results / DocSetCollector):
Gains are largest when the range query is the non-leading docset of a low-cardinality intersection.