🔍 Issue Description
Add a native Rust-based CandleProvider implementing the LLMProvider trait to enable fully local transformer inference without HTTP or external runtime dependencies.
📌 Issue Type
📝 Description
MoFA already defines a canonical LLMProvider trait that allows different inference backends to integrate with the agent orchestration system. Currently supported providers typically rely on:
- External APIs (e.g., OpenAI)
- Local HTTP servers (e.g., Ollama, LM Studio, vLLM)
While these approaches work well, they introduce an additional network/runtime layer between the MoFA kernel and the model runtime.
What is happening?
Most LLM providers currently operate through:
- Remote API calls
- Local HTTP-based inference servers
This introduces extra latency and additional infrastructure dependencies.
What should happen instead?
Introduce a native Rust provider based on the Candle ML runtime that directly implements the LLMProvider trait and performs inference inside the MoFA runtime process.
Why is this needed?
A native Candle backend would provide:
- Fully local inference without HTTP hops
- Lower latency due to direct function calls
- Improved deployment flexibility for offline environments
- A Rust-native model runtime aligned with MoFA's architecture
This would also complement existing providers rather than replace them.
🎯 Proposed Solution (Optional but Encouraged)
High-level approach
Implement a new CandleProvider that conforms to the existing LLMProvider trait and uses the Candle runtime for local transformer inference.
The provider would:
- Load a local transformer model (e.g., LLaMA, Mistral, TinyLlama) using Candle.
- Convert
ChatCompletionRequest into a prompt format.
- Tokenize the prompt.
- Run a generation loop using the Candle model.
- Decode tokens and construct a
ChatCompletionResponse.
Initial scope would focus on minimal functionality:
chat() support for text generation
- basic model loading from a local path
- CPU/GPU inference via Candle
Streaming, embeddings, and tool-calling capabilities could be added incrementally later.
Relevant modules/files
Areas likely involved:
LLMProvider trait implementation
- provider registry or initialization logic
- request/response type mapping
- model loading utilities
Potential edge cases
- Large model memory usage
- tokenizer/model compatibility
- generation parameter mapping
- model loading errors
- device availability (CPU vs GPU)
📎 Additional Context
This idea aligns with the roadmap goal of adding more LLM provider integrations and complements ongoing discussions around enabling local-first inference backends.
Relevant technologies:
- Candle ML framework (Rust)
- HuggingFace safetensors models
- Local transformer inference
A native provider would coexist with existing API and HTTP-based providers.
🙋 Claiming This Issue
To avoid duplicated work:
🔔 Important
I’d like to work on this.
🔍 Issue Description
Add a native Rust-based
CandleProviderimplementing theLLMProvidertrait to enable fully local transformer inference without HTTP or external runtime dependencies.📌 Issue Type
📝 Description
MoFA already defines a canonical
LLMProvidertrait that allows different inference backends to integrate with the agent orchestration system. Currently supported providers typically rely on:While these approaches work well, they introduce an additional network/runtime layer between the MoFA kernel and the model runtime.
What is happening?
Most LLM providers currently operate through:
This introduces extra latency and additional infrastructure dependencies.
What should happen instead?
Introduce a native Rust provider based on the Candle ML runtime that directly implements the
LLMProvidertrait and performs inference inside the MoFA runtime process.Why is this needed?
A native Candle backend would provide:
This would also complement existing providers rather than replace them.
🎯 Proposed Solution (Optional but Encouraged)
High-level approach
Implement a new
CandleProviderthat conforms to the existingLLMProvidertrait and uses the Candle runtime for local transformer inference.The provider would:
ChatCompletionRequestinto a prompt format.ChatCompletionResponse.Initial scope would focus on minimal functionality:
chat()support for text generationStreaming, embeddings, and tool-calling capabilities could be added incrementally later.
Relevant modules/files
Areas likely involved:
LLMProvidertrait implementationPotential edge cases
📎 Additional Context
This idea aligns with the roadmap goal of adding more LLM provider integrations and complements ongoing discussions around enabling local-first inference backends.
Relevant technologies:
A native provider would coexist with existing API and HTTP-based providers.
🙋 Claiming This Issue
To avoid duplicated work:
🔔 Important
I’d like to work on this.