Efficient collaborative decoding between edge-deployed small models and cloud-based large language models.
This project implements Token-Level Routing, a novel collaborative inference system where a small on-device model performs most decoding, selectively routing critical tokens to a powerful cloud-based LLM.
This approach significantly reduces latency and cost while retaining output quality — ideal for edge scenarios like mobile phones or IoT devices.
- ⚡ Efficient: >60% accuracy boost by routing only ~7% of tokens to the LLM.
- 🌐 Edge-Cloud Collaboration: Combines local lightweight models with cloud intelligence.
- 🧭 Token-Level Routing: Fine-grained, confidence-driven token control.
- 📱 Deployable: Lightweight ONNX runtime works on laptops and mobile devices.
- 🖥️ LLM Backend: Compatible with [SGLang] for LLM serving and kv-cache extension.
+-------------+ +-------------+ +-------------+
| User Input |--Prompt-->| SLM (ONNX) |--Tokens-->| Router |
+-------------+ +-------------+ +-------------+
|
Tokens with low confidence
v
+------------------+
| LLM (Server-side)|
+------------------+
See Guideline.md for setup and usage instructions.
- ✅ macOS (Apple M1/M2/M3) are already support
- 🔧 Android under development!
For questions or collaborations, feel free to open an issue or email us.

