feat: add on-device ASR (speech-to-text) and TTS (text-to-speech)

## Summary

Add fully on-device **ASR** (speech-to-text input) and **TTS** (text-to-speech output) to the app, preserving the 100%-private mandate (no cloud STT/TTS). Both were removed in `b859fc2` ("remove transcription + audio"); this issue tracks a redesigned, on-device implementation built on the existing `llama.rn` stack.

## Current state

- STT previously used `whisper.rn` and was removed in `b859fc2`; no TTS has ever existed.
- `expo-audio@~55.0.8` is a dependency and **already exposes recording** (`useAudioRecorder`) — only playback is wired today.
- `llama.rn@^0.11.4` already exports the required primitives: mtmd `initMultimodal()`/`getMultimodalSupport()` + `input_audio` for ASR, and `initVocoder()`/`getFormattedAudioCompletion()`/`decodeAudioTokens()` for TTS.
- The `attachments` schema already supports `type: "audio"` + a `duration` field.
- The download pipeline already handles main + mmproj multi-file bundles.

## Recommended approach — two specialized models (no single model does both on stock `llama.rn`)

No upstream-supported `llama.rn` model performs **both** ASR and TTS (Qwen2.5-Omni understands audio-in but its speech-out side isn't supported upstream; unified audio models like LFM2.5-Audio need an unmerged llama.cpp PR). The cleanest path is two models, both first-class in `llama.rn` and requiring **no forks**:

| Capability | `llama.rn` path | Candidate models (validate exact repo/quant before integration) |
|---|---|---|
| **ASR** | mtmd `input_audio` via `initMultimodal()` | `Mungert/Qwen2.5-Omni-3B-GGUF` / `unsloth/Qwen2.5-Omni-3B-GGUF` (main + `mmproj`), or Ultravox v0.5, or Qwen2-Audio |
| **TTS** | `initVocoder()` → `getFormattedAudioCompletion()` → `decodeAudioTokens()` → PCM → `expo-audio` | `OuteAI/OuteTTS-0.3-1B-GGUF` + its WavTokenizer vocoder GGUF (matches `rn-tts.cpp`'s OuteTTS V0.3 handling) |

**ASR alternative (zero context contention):** re-add `whisper.rn` (the app's original STT binding). It runs in its own native context and never competes with the chat `LlamaContext`, at the cost of a second native dependency.

## Tasks

**Cross-cutting**
- [ ] Extend `MultimodalCapabilities` (or add an `AudioProvider`) in `src/ai-providers/`.
- [ ] Extend the download pipeline (main + mmproj) to **N-file** bundles (add vocoder + optional TTS speaker/tokenizer).
- [ ] Add `vocoderFilename/Url/Size` (+ optional `ttsSpeakerFile*`) columns to `hfModels` + a migration.
- [ ] Decide context strategy under the single-`LlamaContext` constraint: on-demand swap (release chat → load audio) vs. `whisper.rn` for ASR. Respect `src/memory/budget.ts`.

**ASR**
- [ ] Re-add a `useAudioRecorder` hook on `expo-audio` (mic permission flow).
- [ ] Load an audio mmproj via `initMultimodal()`; feed WAV/MP3 as `input_audio` in `completion()`; stream transcription.

**TTS**
- [ ] `initVocoder()` (WavTokenizer) → `getFormattedAudioCompletion()` → `decodeAudioTokens()` → play PCM via `expo-audio`.
- [ ] Validate OuteTTS 0.3 end-to-end; evaluate OuteTTS 1.0 quality once vocoder-enable ([mybigday/llama.rn#152](https://github.com/mybigday/llama.rn/issues/152)) is confirmed.

## Out of scope

- Cloud-based STT/TTS (violates the privacy mandate).
- Unified speech-in/speech-out omni models (e.g. LFM2.5-Audio, Qwen2.5-Omni Talker) until upstream `llama.cpp` support lands.

---

**Sources:**
- [llama.rn](https://github.com/mybigday/llama.rn) — mtmd audio input + OuteTTS vocoder API
- [Qwen2.5-Omni-3B-GGUF (Mungert)](https://huggingface.co/Mungert/Qwen2.5-Omni-3B-GGUF) / [(unsloth)](https://huggingface.co/unsloth/Qwen2.5-Omni-3B-GGUF)
- [OuteAI/Llama-OuteTTS-1.0-1B](https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B)
- [llama.cpp OuteTTS support PR #12794](https://github.com/ggml-org/llama.cpp/pull/12794)
- [llama.cpp Qwen2.5-Omni audio discussion #13949](https://github.com/ggml-org/llama.cpp/discussions/13949)

Capability	`llama.rn` path	Candidate models (validate exact repo/quant before integration)
ASR	mtmd `input_audio` via `initMultimodal()`	`Mungert/Qwen2.5-Omni-3B-GGUF` / `unsloth/Qwen2.5-Omni-3B-GGUF` (main + `mmproj`), or Ultravox v0.5, or Qwen2-Audio
TTS	`initVocoder()` → `getFormattedAudioCompletion()` → `decodeAudioTokens()` → PCM → `expo-audio`	`OuteAI/OuteTTS-0.3-1B-GGUF` + its WavTokenizer vocoder GGUF (matches `rn-tts.cpp`'s OuteTTS V0.3 handling)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add on-device ASR (speech-to-text) and TTS (text-to-speech) #217

Summary

Current state

Recommended approach — two specialized models (no single model does both on stock `llama.rn`)

Tasks

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: add on-device ASR (speech-to-text) and TTS (text-to-speech) #217

Description

Summary

Current state

Recommended approach — two specialized models (no single model does both on stock llama.rn)

Tasks

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Recommended approach — two specialized models (no single model does both on stock `llama.rn`)