Skip to content

Feature: Add Native Candle-based LLMProvider for Fully Local Transformer Inference #1166

@SanchitKS12

Description

@SanchitKS12

🔍 Issue Description

Add a native Rust-based CandleProvider implementing the LLMProvider trait to enable fully local transformer inference without HTTP or external runtime dependencies.


📌 Issue Type

  • Bug
  • Feature Request
  • Enhancement
  • Documentation
  • Refactor
  • Other (please specify)

📝 Description

MoFA already defines a canonical LLMProvider trait that allows different inference backends to integrate with the agent orchestration system. Currently supported providers typically rely on:

  • External APIs (e.g., OpenAI)
  • Local HTTP servers (e.g., Ollama, LM Studio, vLLM)

While these approaches work well, they introduce an additional network/runtime layer between the MoFA kernel and the model runtime.

What is happening?

Most LLM providers currently operate through:

  • Remote API calls
  • Local HTTP-based inference servers

This introduces extra latency and additional infrastructure dependencies.

What should happen instead?

Introduce a native Rust provider based on the Candle ML runtime that directly implements the LLMProvider trait and performs inference inside the MoFA runtime process.

Why is this needed?

A native Candle backend would provide:

  • Fully local inference without HTTP hops
  • Lower latency due to direct function calls
  • Improved deployment flexibility for offline environments
  • A Rust-native model runtime aligned with MoFA's architecture

This would also complement existing providers rather than replace them.


🎯 Proposed Solution (Optional but Encouraged)

High-level approach

Implement a new CandleProvider that conforms to the existing LLMProvider trait and uses the Candle runtime for local transformer inference.

The provider would:

  1. Load a local transformer model (e.g., LLaMA, Mistral, TinyLlama) using Candle.
  2. Convert ChatCompletionRequest into a prompt format.
  3. Tokenize the prompt.
  4. Run a generation loop using the Candle model.
  5. Decode tokens and construct a ChatCompletionResponse.

Initial scope would focus on minimal functionality:

  • chat() support for text generation
  • basic model loading from a local path
  • CPU/GPU inference via Candle

Streaming, embeddings, and tool-calling capabilities could be added incrementally later.

Relevant modules/files

Areas likely involved:

  • LLMProvider trait implementation
  • provider registry or initialization logic
  • request/response type mapping
  • model loading utilities

Potential edge cases

  • Large model memory usage
  • tokenizer/model compatibility
  • generation parameter mapping
  • model loading errors
  • device availability (CPU vs GPU)

📎 Additional Context

This idea aligns with the roadmap goal of adding more LLM provider integrations and complements ongoing discussions around enabling local-first inference backends.

Relevant technologies:

  • Candle ML framework (Rust)
  • HuggingFace safetensors models
  • Local transformer inference

A native provider would coexist with existing API and HTTP-based providers.


🙋 Claiming This Issue

To avoid duplicated work:

  • I'm willing to solve this issue by myself

🔔 Important

I’d like to work on this.

Metadata

Metadata

Assignees

Labels

area/modelsModel selection, behavior, responseskind/featureNew capabilitypriority/p2Medium priorityrustPull requests that update rust code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions