Feature: Add Native Candle-based LLMProvider for Fully Local Transformer Inference

## 🔍 Issue Description

Add a native Rust-based `CandleProvider` implementing the `LLMProvider` trait to enable fully local transformer inference without HTTP or external runtime dependencies.

---

## 📌 Issue Type

* [ ] Bug
* [x] Feature Request
* [x] Enhancement
* [ ] Documentation
* [ ] Refactor
* [ ] Other (please specify)

---

## 📝 Description

MoFA already defines a canonical `LLMProvider` trait that allows different inference backends to integrate with the agent orchestration system. Currently supported providers typically rely on:

* External APIs (e.g., OpenAI)
* Local HTTP servers (e.g., Ollama, LM Studio, vLLM)

While these approaches work well, they introduce an additional network/runtime layer between the MoFA kernel and the model runtime.

### What is happening?

Most LLM providers currently operate through:

* Remote API calls
* Local HTTP-based inference servers

This introduces extra latency and additional infrastructure dependencies.

### What should happen instead?

Introduce a **native Rust provider** based on the **Candle** ML runtime that directly implements the `LLMProvider` trait and performs inference inside the MoFA runtime process.

### Why is this needed?

A native Candle backend would provide:

* Fully local inference without HTTP hops
* Lower latency due to direct function calls
* Improved deployment flexibility for offline environments
* A Rust-native model runtime aligned with MoFA's architecture

This would also complement existing providers rather than replace them.

---

## 🎯 Proposed Solution (Optional but Encouraged)

### High-level approach

Implement a new `CandleProvider` that conforms to the existing `LLMProvider` trait and uses the Candle runtime for local transformer inference.

The provider would:

1. Load a local transformer model (e.g., LLaMA, Mistral, TinyLlama) using Candle.
2. Convert `ChatCompletionRequest` into a prompt format.
3. Tokenize the prompt.
4. Run a generation loop using the Candle model.
5. Decode tokens and construct a `ChatCompletionResponse`.

Initial scope would focus on minimal functionality:

* `chat()` support for text generation
* basic model loading from a local path
* CPU/GPU inference via Candle

Streaming, embeddings, and tool-calling capabilities could be added incrementally later.

### Relevant modules/files

Areas likely involved:

* `LLMProvider` trait implementation
* provider registry or initialization logic
* request/response type mapping
* model loading utilities

### Potential edge cases

* Large model memory usage
* tokenizer/model compatibility
* generation parameter mapping
* model loading errors
* device availability (CPU vs GPU)

---

## 📎 Additional Context

This idea aligns with the roadmap goal of **adding more LLM provider integrations** and complements ongoing discussions around enabling local-first inference backends.

Relevant technologies:

* Candle ML framework (Rust)
* HuggingFace safetensors models
* Local transformer inference

A native provider would coexist with existing API and HTTP-based providers.

---

## 🙋 Claiming This Issue

To avoid duplicated work:

* [x] I'm willing to solve this issue by myself

### 🔔 Important

I’d like to work on this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add Native Candle-based LLMProvider for Fully Local Transformer Inference #1166

🔍 Issue Description

📌 Issue Type

📝 Description

What is happening?

What should happen instead?

Why is this needed?

🎯 Proposed Solution (Optional but Encouraged)

High-level approach

Relevant modules/files

Potential edge cases

📎 Additional Context

🙋 Claiming This Issue

🔔 Important

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: Add Native Candle-based LLMProvider for Fully Local Transformer Inference #1166

Description

🔍 Issue Description

📌 Issue Type

📝 Description

What is happening?

What should happen instead?

Why is this needed?

🎯 Proposed Solution (Optional but Encouraged)

High-level approach

Relevant modules/files

Potential edge cases

📎 Additional Context

🙋 Claiming This Issue

🔔 Important

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions