Skip to content

Fix MoreLikeThis panic on indexes with deleted documents#2964

Open
stumpylog wants to merge 1 commit into
quickwit-oss:mainfrom
stumpylog:fix/more-like-this-deleted-docs-panic
Open

Fix MoreLikeThis panic on indexes with deleted documents#2964
stumpylog wants to merge 1 commit into
quickwit-oss:mainfrom
stumpylog:fix/more-like-this-deleted-docs-panic

Conversation

@stumpylog

Copy link
Copy Markdown

What

MoreLikeThisQuery panics with doc_count >= doc_freq (in idf(), src/query/bm25.rs) when the searched index contains soft-deleted documents.

thread '<unnamed>' panicked at src/query/bm25.rs:53:5:
95 >= 100

Why

In MoreLikeThis::create_score_term the idf denominator is computed from SegmentReader::num_docs(), which counts alive documents only:

let num_docs = searcher
    .segment_readers()
    .iter()
    .map(|segment_reader| segment_reader.num_docs() as u64)
    .sum::<u64>();
...
let doc_freq = searcher.doc_freq(term)?;
let idf = idf(doc_freq, num_docs);

searcher.doc_freq(term) reads doc_freq straight from the term dictionary, and that value keeps counting soft-deleted documents until a merge expunges them. So after documents are deleted (e.g. a bulk import followed by deletions, with no merge in between), a common term's doc_freq can exceed the alive-only num_docs, and idf() trips its assert!(doc_count >= doc_freq).

Ordinary search does not hit this because the standard BM25 statistics provider uses Searcher::total_num_docs(), which sums SegmentReader::max_doc() (deleted documents included) — consistent with how doc_freq is counted.

Fix

Use max_doc() instead of num_docs() for the MLT idf denominator, so the document count and doc_freq are counted consistently (both include deleted documents). This mirrors total_num_docs() used by the regular BM25 scorer.

Test

Adds test_more_like_this_query_with_deleted_documents: indexes six documents sharing body terms (with NoMergePolicy so postings survive), deletes two, and runs a MoreLikeThis query against a survivor. It panics on main and passes with this change.

Context

Discovered downstream in paperless-ngx, where "More Like This" returns HTTP 500 after bulk import/delete cycles: paperless-ngx/paperless-ngx#13024

Developed with assistance from Claude Code.

create_score_term used the alive-only document count (num_docs) as the
idf denominator, while the term doc_freq it reads from the term
dictionary still counts soft-deleted documents until a merge expunges
them. After deletions this could make doc_freq > num_docs, tripping the
`doc_count >= doc_freq` assertion in idf() and panicking the search
thread.

Use max_doc (which includes deleted documents) instead, matching how the
standard BM25 scorer computes total_num_docs. Add a regression test that
deletes documents and runs a MoreLikeThis query.

Discovered downstream in paperless-ngx:
paperless-ngx/paperless-ngx#13024

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@stumpylog stumpylog force-pushed the fix/more-like-this-deleted-docs-panic branch from d0dc492 to 729d274 Compare June 17, 2026 14:53
Comment thread CHANGELOG.md
================================

## Bugfixes
- Fix `MoreLikeThis` panic on indexes with deleted documents [#2964](https://github.com/quickwit-oss/tantivy/pull/2964)(@stumpylog)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little confused about the released vs not state. There's a 0.26.1 tag (https://github.com/quickwit-oss/tantivy/releases/tag/0.26.1) but the header here says 0.26.0 is (Unreleased). So just let me know if this needs moving/updating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant