feat: support ONNX models that require token_type_ids by j-sperling · Pull Request #607 · zilliztech/memsearch

j-sperling · 2026-07-05T06:26:12Z

Summary

BERT-family ONNX exports declare a token_type_ids input, and session.run() requires every declared input in the feed — so models like Xenova/all-MiniLM-L6-v2 and Xenova/bge-small-en-v1.5 currently fail with Required inputs (['token_type_ids']) are missing. This detects declared inputs once at init and feeds all-zero segment ids when required (correct for single-sequence embedding). XLM-R-family models (the bge-m3 default) are unaffected.
Why it matters: it unlocks a small-model tier for bulk ingest without touching the default. Verified locally: Xenova/bge-small-en-v1.5 (33M, CLS-pooling-native — matching this provider's CLS pooling) loads, passes a semantic sanity check, and embeds a 64-text batch in ~0.02s vs ~6.6s for the 568M int8 bge-m3 default on the same CPU. Corpus-scale context: indexing 186K chunks with the default took >4.5h on an M-series CPU; a small-tier model brings that to minutes via embedding.model config, no code change.
Deliberately not changing DEFAULT_MODELS["onnx"]: a default swap changes embedding dimension (1024 → 384) and would break existing Milvus collections on upgrade. Users opt in per install via memsearch config set embedding.model.
Caveat worth documenting: Xenova/multilingual-e5-small now loads but is not recommended — e5 models expect query:/passage: prefixes and mean pooling, and scored poorly on the sanity check under CLS pooling.

Test plan

tests/test_embeddings_onnx_inputs.py: stub-session tests that token_type_ids is fed as zeros (shape-matched to input_ids) when declared, and omitted when not — no model download needed
Full suite: 238 passed, 7 skipped
ruff check / ruff format --check clean
Live load + embed + semantic sanity for bge-small-en-v1.5, all-MiniLM-L6-v2, multilingual-e5-small; default bge-m3 path unchanged

BERT-family ONNX exports (Xenova/all-MiniLM-L6-v2, Xenova/bge-small-en-v1.5) declare a token_type_ids input, and session.run() requires every declared input to be fed, so such models fail to embed with "Required inputs (['token_type_ids']) are missing". Detect the declared inputs once at init and feed all-zero segment ids when required (single-sequence embedding). This unlocks a small-model tier for bulk ingest: bge-small-en-v1.5 (33M, CLS-pooling-native, matching this provider's pooling) embeds a 64-text batch in ~0.02s vs ~6.6s for the 568M default on the same CPU.

j-sperling added 3 commits July 4, 2026 23:26

test: build stub encoding in __init__ to satisfy RUF012

a7c882b

Keep session input names local to __init__

d670b45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support ONNX models that require token_type_ids#607

feat: support ONNX models that require token_type_ids#607
j-sperling wants to merge 3 commits into
zilliztech:mainfrom
j-sperling:feat/onnx-token-type-ids

j-sperling commented Jul 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

j-sperling commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

j-sperling commented Jul 5, 2026 •

edited

Loading