Skip to content

feat: add custom doc id mapping finalization#2978

Draft
Mallets wants to merge 6 commits into
mainfrom
mallets/finalize-docidmapping
Draft

feat: add custom doc id mapping finalization#2978
Mallets wants to merge 6 commits into
mainfrom
mallets/finalize-docidmapping

Conversation

@Mallets

@Mallets Mallets commented Jun 26, 2026

Copy link
Copy Markdown

What

Adds manual doc id mapping finalization for segment writers. Callers can provide a DocIdMapping at finalize time, and Tantivy will serialize postings, fast fields, fieldnorms, docstore, and opstamps in the mapped doc id order.

Why

Index sorting already remaps doc ids internally from sort_by_field. This PR exposes the same finalization machinery for callers that compute their own doc id order.

Changes

  • Add SegmentWriter::finalize_with_doc_id_mapping(&DocIdMapping).
  • Add SingleSegmentIndexWriter::finalize_with_doc_id_mapping(&DocIdMapping).
  • Add hidden, non-persisted IndexSettings::manual_doc_id_mapping (#[serde(skip)]) to opt into the temporary docstore path needed by manual mapping.
  • Extend SegmentSerializer::for_segment(...) so it writes docs to TempStore when either sort_by_field is configured or manual_doc_id_mapping is enabled.
  • Keep SegmentWriter::for_segment(...) as the single constructor path; manual mapping is controlled by settings rather than extra constructors.
  • Validate caller-provided mappings before serialization: length must match max_doc, old doc ids must be in range, and old doc ids must not repeat.
  • Make common::BitSet::insert() return whether the set changed, and use that for duplicate detection in mapping validation.
  • Export DocIdMapping from indexer, restrict internal accessors to pub(crate), and document the permutation requirement for DocIdMapping::from_new_id_to_old_id().

Notes

  • manual_doc_id_mapping is a runtime construction flag only; it is not written to meta.json.
  • The existing sort_by_field behavior is preserved.
  • Fieldnorm buffers are padded before remapping because remapping indexes them by old doc id.
  • Single-segment writer support includes a regression test for remapped stored documents and fieldnorms.

Test plan

  • cargo fmt
  • cargo test --lib core::tests::test_single_segment_index_writer
  • cargo test --lib indexer::segment_writer::tests::test_finalize_with_doc_id_mapping
  • cargo test --lib index::index_meta::tests::test_index_settings_default
  • cargo test --lib indexer

Co-authored-by: Cursor <cursoragent@cursor.com>
@Mallets Mallets self-assigned this Jun 26, 2026
Mallets and others added 5 commits June 26, 2026 16:24
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…gs initializer

Struct literal was missing the new field introduced in the manual doc id
mapping refactor, causing a compile error under --all-features.

Co-authored-by: Cursor <cursoragent@cursor.com>
The regression test calls finalize_with_doc_id_mapping, so the index must opt into the temporary docstore path before constructing the segment writer.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant