Skip to content

feat: auto-select ONNX execution providers with CPU-safe fallback#606

Draft
j-sperling wants to merge 2 commits into
zilliztech:mainfrom
j-sperling:feat/onnx-execution-providers
Draft

feat: auto-select ONNX execution providers with CPU-safe fallback#606
j-sperling wants to merge 2 commits into
zilliztech:mainfrom
j-sperling:feat/onnx-execution-providers

Conversation

@j-sperling

Copy link
Copy Markdown
Contributor

Summary

  • OnnxEmbedding builds its InferenceSession without a providers argument, pinning inference to CPU even on CUDA builds of onnxruntime. This selects providers explicitly: CUDA auto-selected when available, CPU always appended as the final fallback, plus a CPU-only retry if accelerator session creation fails.
  • Selection is a deliberate allowlist rather than get_available_providers() order: the stock macOS wheel exposes AzureExecutionProvider (remote), which should never be picked implicitly for a local zero-config embedding path.
  • CoreML is opt-in only (providers= argument), based on measurement rather than assumption: on the default gpahal/bge-m3-onnx-int8 export, CoreML places only 1490/2384 graph nodes across 220 partitions, and the CPU<->CoreML copy overhead measured ~60x slower than plain CPU (407.7s vs 6.6s for 64 texts, Apple M-series). Draft in part to surface this trade-off for discussion.

Test plan

  • tests/test_embeddings_onnx_providers.py: pure-function coverage for auto-selection (CUDA picked, CoreML/Azure not auto-picked, CPU-only runtime, explicit-request filtering, CPU always appended)
  • Full suite: 244 passed, 7 skipped
  • ruff check / ruff format --check clean
  • Live session on the default model builds and embeds under both CPU-only and explicit-CoreML provider lists (dim 1024)
  • CUDA-build timing (no NVIDIA hardware here; expected to benefit, needs a CUDA runner)

InferenceSession was constructed without a providers argument, which
pins inference to CPUExecutionProvider even on CUDA builds. Select
providers explicitly: CUDA when available, CPU always appended as the
final fallback, and retry CPU-only if accelerator session creation
fails.

Selection is an allowlist, not get_available_providers() order: the
stock macOS wheel exposes AzureExecutionProvider, which must never be
picked implicitly for a local zero-config path. CoreML is opt-in only
via the providers argument -- on the default int8 bge-m3 export CoreML
places 1490/2384 nodes across 220 partitions and measured ~60x slower
than plain CPU from partition copy overhead.
Both fallback paths were silent: a requested-but-unavailable provider was
filtered out, and a failed accelerator session quietly retried CPU-only. A
user who thinks they are on CUDA could be on CPU with no signal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant