feat: support ONNX models that require token_type_ids#607
Draft
j-sperling wants to merge 3 commits into
Draft
Conversation
BERT-family ONNX exports (Xenova/all-MiniLM-L6-v2, Xenova/bge-small-en-v1.5) declare a token_type_ids input, and session.run() requires every declared input to be fed, so such models fail to embed with "Required inputs (['token_type_ids']) are missing". Detect the declared inputs once at init and feed all-zero segment ids when required (single-sequence embedding). This unlocks a small-model tier for bulk ingest: bge-small-en-v1.5 (33M, CLS-pooling-native, matching this provider's pooling) embeds a 64-text batch in ~0.02s vs ~6.6s for the 568M default on the same CPU.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
token_type_idsinput, andsession.run()requires every declared input in the feed — so models likeXenova/all-MiniLM-L6-v2andXenova/bge-small-en-v1.5currently fail withRequired inputs (['token_type_ids']) are missing. This detects declared inputs once at init and feeds all-zero segment ids when required (correct for single-sequence embedding). XLM-R-family models (the bge-m3 default) are unaffected.Xenova/bge-small-en-v1.5(33M, CLS-pooling-native — matching this provider's CLS pooling) loads, passes a semantic sanity check, and embeds a 64-text batch in ~0.02s vs ~6.6s for the 568M int8 bge-m3 default on the same CPU. Corpus-scale context: indexing 186K chunks with the default took >4.5h on an M-series CPU; a small-tier model brings that to minutes viaembedding.modelconfig, no code change.DEFAULT_MODELS["onnx"]: a default swap changes embedding dimension (1024 → 384) and would break existing Milvus collections on upgrade. Users opt in per install viamemsearch config set embedding.model.Xenova/multilingual-e5-smallnow loads but is not recommended — e5 models expectquery:/passage:prefixes and mean pooling, and scored poorly on the sanity check under CLS pooling.Test plan
tests/test_embeddings_onnx_inputs.py: stub-session tests thattoken_type_idsis fed as zeros (shape-matched toinput_ids) when declared, and omitted when not — no model download neededruff check/ruff format --checkclean