Single binary, C++ implementation of Qwen3-TTS running on top of ONNX Runtime.
leaxer-qwen3-tts -m <model_dir> -p "Hello world" -o output.wav
# With language hint
leaxer-qwen3-tts -m onnx_kv_06b -p "你好世界" --lang zh -o chinese.wav
# Sampling controls
leaxer-qwen3-tts -m onnx_kv_06b -p "Hello" --temp 0.7 --top-k 30 --top-p 0.9| Flag | Description | Default |
|---|---|---|
-m, --model |
ONNX model directory | required |
-p, --prompt |
Text to synthesize | required |
-o, --output |
Output WAV path | output.wav |
--lang |
Language: auto, en, zh, ja, ko |
auto |
--temp |
Sampling temperature | 0.8 |
--top-k |
Top-k sampling | 50 |
--top-p |
Top-p (nucleus) sampling | 0.95 |
--max-tokens |
Max generation tokens | 2048 |
- CMake 3.14+
- C++17 compiler
- ONNX Runtime (user-provided)
# Download ONNX Runtime from https://github.com/microsoft/onnxruntime/releases
# Extract to a directory, then:
cmake -B build -DONNXRUNTIME_DIR=/path/to/onnxruntime
cmake --build build -j
./build/leaxer-qwen3-tts --helpbrew install onnxruntime cmake
cmake -B build -DONNXRUNTIME_DIR=$(brew --prefix onnxruntime)
cmake --build build -j# CoreML (macOS, requires ONNX Runtime with CoreML support)
cmake -B build -DONNXRUNTIME_DIR=/path/to/onnxruntime -DLEAXER_COREML=ON
# CUDA (NVIDIA)
cmake -B build -DONNXRUNTIME_DIR=/path/to/onnxruntime -DLEAXER_CUDA=ON
# ROCm (AMD)
cmake -B build -DONNXRUNTIME_DIR=/path/to/onnxruntime -DLEAXER_ROCM=ONNote: GPU acceleration requires ONNX Runtime built with the corresponding execution provider. The default Microsoft releases include CoreML for macOS.
Download ONNX models from zukky/Qwen3-TTS-ONNX-DLL:
# Clone model repo (or download manually)
git lfs install
git clone https://huggingface.co/zukky/Qwen3-TTS-ONNX-DLL models
# Use the 0.6B model
./build/leaxer-qwen3-tts -m onnx/onnx_kv_06b -p "Hello"text_project.onnx— text token embeddingscodec_embed.onnx— codec token embeddingscode_predictor_embed.onnx— sub-codec embeddingstalker_prefill.onnx— transformer prefilltalker_decode.onnx— transformer decode (with KV cache)code_predictor.onnx— predict codebooks 1-15tokenizer12hz_decode.onnx— vocoder (codes → audio)
Also needs tokenizer files in ../models/Qwen3-TTS-12Hz-0.6B-Base/:
vocab.jsonmerges.txt
Text → BPE Tokenizer → Talker (prefill/decode) → Code Predictor → Vocoder → WAV
↓ ↓
KV Cache Codebooks 1-15
The model generates 16 audio codebooks per frame at 12Hz, then the vocoder upsamples to 24kHz audio.
Clone any voice from a 3-second reference audio:
leaxer-qwen3-tts -m onnx/onnx_kv_06b -p "Hello world" --ref voice_sample.wav -o output.wavThe reference audio should be:
- WAV format (8/16/24/32-bit PCM or float)
- At least 3 seconds of clear speech
- Will be resampled to 24kHz internally
| Feature | Requires | Status |
|---|---|---|
Voice clone (--ref) |
0.6B-Base | Done |
| GPU acceleration | ONNX Runtime + CoreML/CUDA | Done |
Preset speakers (--speaker) |
0.6B-CustomVoice | Planned |
Voice instructions (--instruct) |
1.7B-VoiceDesign | Planned |
| Single static binary | Static ONNX Runtime | Planned |
- Qwen3-TTS by Alibaba
- zukky/Qwen3-TTS-ONNX-DLL for ONNX exports, big thanks to Mr. Daishi Suzuki (zukky)!
Apache 2.0 — see LICENSE