feat: add custom doc id mapping finalization#2978
Draft
Mallets wants to merge 6 commits into
Draft
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…gs initializer Struct literal was missing the new field introduced in the manual doc id mapping refactor, causing a compile error under --all-features. Co-authored-by: Cursor <cursoragent@cursor.com>
The regression test calls finalize_with_doc_id_mapping, so the index must opt into the temporary docstore path before constructing the segment writer. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds manual doc id mapping finalization for segment writers. Callers can provide a
DocIdMappingat finalize time, and Tantivy will serialize postings, fast fields, fieldnorms, docstore, and opstamps in the mapped doc id order.Why
Index sorting already remaps doc ids internally from
sort_by_field. This PR exposes the same finalization machinery for callers that compute their own doc id order.Changes
SegmentWriter::finalize_with_doc_id_mapping(&DocIdMapping).SingleSegmentIndexWriter::finalize_with_doc_id_mapping(&DocIdMapping).IndexSettings::manual_doc_id_mapping(#[serde(skip)]) to opt into the temporary docstore path needed by manual mapping.SegmentSerializer::for_segment(...)so it writes docs toTempStorewhen eithersort_by_fieldis configured ormanual_doc_id_mappingis enabled.SegmentWriter::for_segment(...)as the single constructor path; manual mapping is controlled by settings rather than extra constructors.max_doc, old doc ids must be in range, and old doc ids must not repeat.common::BitSet::insert()return whether the set changed, and use that for duplicate detection in mapping validation.DocIdMappingfromindexer, restrict internal accessors topub(crate), and document the permutation requirement forDocIdMapping::from_new_id_to_old_id().Notes
manual_doc_id_mappingis a runtime construction flag only; it is not written tometa.json.sort_by_fieldbehavior is preserved.Test plan
cargo fmtcargo test --lib core::tests::test_single_segment_index_writercargo test --lib indexer::segment_writer::tests::test_finalize_with_doc_id_mappingcargo test --lib index::index_meta::tests::test_index_settings_defaultcargo test --lib indexer