Fix MoreLikeThis panic on indexes with deleted documents#2964
Open
stumpylog wants to merge 1 commit into
Open
Conversation
create_score_term used the alive-only document count (num_docs) as the idf denominator, while the term doc_freq it reads from the term dictionary still counts soft-deleted documents until a merge expunges them. After deletions this could make doc_freq > num_docs, tripping the `doc_count >= doc_freq` assertion in idf() and panicking the search thread. Use max_doc (which includes deleted documents) instead, matching how the standard BM25 scorer computes total_num_docs. Add a regression test that deletes documents and runs a MoreLikeThis query. Discovered downstream in paperless-ngx: paperless-ngx/paperless-ngx#13024 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
d0dc492 to
729d274
Compare
stumpylog
commented
Jun 17, 2026
| ================================ | ||
|
|
||
| ## Bugfixes | ||
| - Fix `MoreLikeThis` panic on indexes with deleted documents [#2964](https://github.com/quickwit-oss/tantivy/pull/2964)(@stumpylog) |
Author
There was a problem hiding this comment.
I was a little confused about the released vs not state. There's a 0.26.1 tag (https://github.com/quickwit-oss/tantivy/releases/tag/0.26.1) but the header here says 0.26.0 is (Unreleased). So just let me know if this needs moving/updating.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
MoreLikeThisQuerypanics withdoc_count >= doc_freq(inidf(),src/query/bm25.rs) when the searched index contains soft-deleted documents.Why
In
MoreLikeThis::create_score_termthe idf denominator is computed fromSegmentReader::num_docs(), which counts alive documents only:searcher.doc_freq(term)readsdoc_freqstraight from the term dictionary, and that value keeps counting soft-deleted documents until a merge expunges them. So after documents are deleted (e.g. a bulk import followed by deletions, with no merge in between), a common term'sdoc_freqcan exceed the alive-onlynum_docs, andidf()trips itsassert!(doc_count >= doc_freq).Ordinary search does not hit this because the standard BM25 statistics provider uses
Searcher::total_num_docs(), which sumsSegmentReader::max_doc()(deleted documents included) — consistent with howdoc_freqis counted.Fix
Use
max_doc()instead ofnum_docs()for the MLT idf denominator, so the document count anddoc_freqare counted consistently (both include deleted documents). This mirrorstotal_num_docs()used by the regular BM25 scorer.Test
Adds
test_more_like_this_query_with_deleted_documents: indexes six documents sharing body terms (withNoMergePolicyso postings survive), deletes two, and runs a MoreLikeThis query against a survivor. It panics onmainand passes with this change.Context
Discovered downstream in paperless-ngx, where "More Like This" returns HTTP 500 after bulk import/delete cycles: paperless-ngx/paperless-ngx#13024
Developed with assistance from Claude Code.