feat: auto-select ONNX execution providers with CPU-safe fallback by j-sperling · Pull Request #606 · zilliztech/memsearch

j-sperling · 2026-07-05T06:23:07Z

Summary

OnnxEmbedding builds its InferenceSession without a providers argument, pinning inference to CPU even on CUDA builds of onnxruntime. This selects providers explicitly: CUDA auto-selected when available, CPU always appended as the final fallback, plus a CPU-only retry if accelerator session creation fails.
Selection is a deliberate allowlist rather than get_available_providers() order: the stock macOS wheel exposes AzureExecutionProvider (remote), which should never be picked implicitly for a local zero-config embedding path.
CoreML is opt-in only (providers= argument), based on measurement rather than assumption: on the default gpahal/bge-m3-onnx-int8 export, CoreML places only 1490/2384 graph nodes across 220 partitions, and the CPU<->CoreML copy overhead measured ~60x slower than plain CPU (407.7s vs 6.6s for 64 texts, Apple M-series). Draft in part to surface this trade-off for discussion.

Test plan

tests/test_embeddings_onnx_providers.py: pure-function coverage for auto-selection (CUDA picked, CoreML/Azure not auto-picked, CPU-only runtime, explicit-request filtering, CPU always appended)
Full suite: 244 passed, 7 skipped
ruff check / ruff format --check clean
Live session on the default model builds and embeds under both CPU-only and explicit-CoreML provider lists (dim 1024)
CUDA-build timing (no NVIDIA hardware here; expected to benefit, needs a CUDA runner)

InferenceSession was constructed without a providers argument, which pins inference to CPUExecutionProvider even on CUDA builds. Select providers explicitly: CUDA when available, CPU always appended as the final fallback, and retry CPU-only if accelerator session creation fails. Selection is an allowlist, not get_available_providers() order: the stock macOS wheel exposes AzureExecutionProvider, which must never be picked implicitly for a local zero-config path. CoreML is opt-in only via the providers argument -- on the default int8 bge-m3 export CoreML places 1490/2384 nodes across 220 partitions and measured ~60x slower than plain CPU from partition copy overhead.

Both fallback paths were silent: a requested-but-unavailable provider was filtered out, and a failed accelerator session quietly retried CPU-only. A user who thinks they are on CUDA could be on CPU with no signal.

j-sperling added 2 commits July 4, 2026 23:23

Warn on dropped requested providers and CPU-retry fallback

2872158

Both fallback paths were silent: a requested-but-unavailable provider was filtered out, and a failed accelerator session quietly retried CPU-only. A user who thinks they are on CUDA could be on CPU with no signal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: auto-select ONNX execution providers with CPU-safe fallback#606

feat: auto-select ONNX execution providers with CPU-safe fallback#606
j-sperling wants to merge 2 commits into
zilliztech:mainfrom
j-sperling:feat/onnx-execution-providers

j-sperling commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

j-sperling commented Jul 5, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant