Skip to content

perf: raise ONNX default batch size to 64#608

Draft
j-sperling wants to merge 1 commit into
zilliztech:mainfrom
j-sperling:perf/onnx-default-batch-size
Draft

perf: raise ONNX default batch size to 64#608
j-sperling wants to merge 1 commit into
zilliztech:mainfrom
j-sperling:perf/onnx-default-batch-size

Conversation

@j-sperling

Copy link
Copy Markdown
Contributor

Summary

  • Raises OnnxEmbedding._DEFAULT_BATCH_SIZE from 32 to 64. Measured on the default gpahal/bge-m3-onnx-int8 (CPU, Apple M-series, 128 texts x ~230 tokens): 14.1s @ batch 16, 13.2s @ 32, 11.7s @ 64, 10.6s @ 128 — so 64 is ~11% faster than the current default. On corpus-scale indexing (observed >4.5h for 186K chunks with this model) that's tens of minutes.
  • 128 is deliberately not the default: the tokenizer pads to the longest text in a batch with max_length=8192, so a worst-case batch of long texts at 128-wide materializes multi-GB activation tensors. 64 keeps the worst case bounded while capturing most of the win; users can still set embedding.batch_size = 128 in config where memory allows.
  • embedding.batch_size config continues to override; 0 still means "provider default".

Test plan

  • Full suite passes (no test pins the ONNX default batch size)
  • ruff check / ruff format --check clean
  • Timing measurements above reproduce with a 4-text warmup + time.perf_counter around embed()

Measured on the default int8 bge-m3 model (CPU, Apple M-series,
128 texts): 13.2s at batch 32, 11.7s at 64, 10.6s at 128. 64 gives
~11% indexing throughput over the old default; 128 is left to explicit
configuration because worst-case padded batches of 8192-token inputs
materialize multi-GB activation tensors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant