Summary
Add fully on-device ASR (speech-to-text input) and TTS (text-to-speech output) to the app, preserving the 100%-private mandate (no cloud STT/TTS). Both were removed in b859fc2 ("remove transcription + audio"); this issue tracks a redesigned, on-device implementation built on the existing llama.rn stack.
Current state
- STT previously used
whisper.rn and was removed in b859fc2; no TTS has ever existed.
expo-audio@~55.0.8 is a dependency and already exposes recording (useAudioRecorder) — only playback is wired today.
llama.rn@^0.11.4 already exports the required primitives: mtmd initMultimodal()/getMultimodalSupport() + input_audio for ASR, and initVocoder()/getFormattedAudioCompletion()/decodeAudioTokens() for TTS.
- The
attachments schema already supports type: "audio" + a duration field.
- The download pipeline already handles main + mmproj multi-file bundles.
Recommended approach — two specialized models (no single model does both on stock llama.rn)
No upstream-supported llama.rn model performs both ASR and TTS (Qwen2.5-Omni understands audio-in but its speech-out side isn't supported upstream; unified audio models like LFM2.5-Audio need an unmerged llama.cpp PR). The cleanest path is two models, both first-class in llama.rn and requiring no forks:
| Capability |
llama.rn path |
Candidate models (validate exact repo/quant before integration) |
| ASR |
mtmd input_audio via initMultimodal() |
Mungert/Qwen2.5-Omni-3B-GGUF / unsloth/Qwen2.5-Omni-3B-GGUF (main + mmproj), or Ultravox v0.5, or Qwen2-Audio |
| TTS |
initVocoder() → getFormattedAudioCompletion() → decodeAudioTokens() → PCM → expo-audio |
OuteAI/OuteTTS-0.3-1B-GGUF + its WavTokenizer vocoder GGUF (matches rn-tts.cpp's OuteTTS V0.3 handling) |
ASR alternative (zero context contention): re-add whisper.rn (the app's original STT binding). It runs in its own native context and never competes with the chat LlamaContext, at the cost of a second native dependency.
Tasks
Cross-cutting
ASR
TTS
Out of scope
- Cloud-based STT/TTS (violates the privacy mandate).
- Unified speech-in/speech-out omni models (e.g. LFM2.5-Audio, Qwen2.5-Omni Talker) until upstream
llama.cpp support lands.
Sources:
Summary
Add fully on-device ASR (speech-to-text input) and TTS (text-to-speech output) to the app, preserving the 100%-private mandate (no cloud STT/TTS). Both were removed in
b859fc2("remove transcription + audio"); this issue tracks a redesigned, on-device implementation built on the existingllama.rnstack.Current state
whisper.rnand was removed inb859fc2; no TTS has ever existed.expo-audio@~55.0.8is a dependency and already exposes recording (useAudioRecorder) — only playback is wired today.llama.rn@^0.11.4already exports the required primitives: mtmdinitMultimodal()/getMultimodalSupport()+input_audiofor ASR, andinitVocoder()/getFormattedAudioCompletion()/decodeAudioTokens()for TTS.attachmentsschema already supportstype: "audio"+ adurationfield.Recommended approach — two specialized models (no single model does both on stock
llama.rn)No upstream-supported
llama.rnmodel performs both ASR and TTS (Qwen2.5-Omni understands audio-in but its speech-out side isn't supported upstream; unified audio models like LFM2.5-Audio need an unmerged llama.cpp PR). The cleanest path is two models, both first-class inllama.rnand requiring no forks:llama.rnpathinput_audioviainitMultimodal()Mungert/Qwen2.5-Omni-3B-GGUF/unsloth/Qwen2.5-Omni-3B-GGUF(main +mmproj), or Ultravox v0.5, or Qwen2-AudioinitVocoder()→getFormattedAudioCompletion()→decodeAudioTokens()→ PCM →expo-audioOuteAI/OuteTTS-0.3-1B-GGUF+ its WavTokenizer vocoder GGUF (matchesrn-tts.cpp's OuteTTS V0.3 handling)ASR alternative (zero context contention): re-add
whisper.rn(the app's original STT binding). It runs in its own native context and never competes with the chatLlamaContext, at the cost of a second native dependency.Tasks
Cross-cutting
MultimodalCapabilities(or add anAudioProvider) insrc/ai-providers/.vocoderFilename/Url/Size(+ optionalttsSpeakerFile*) columns tohfModels+ a migration.LlamaContextconstraint: on-demand swap (release chat → load audio) vs.whisper.rnfor ASR. Respectsrc/memory/budget.ts.ASR
useAudioRecorderhook onexpo-audio(mic permission flow).initMultimodal(); feed WAV/MP3 asinput_audioincompletion(); stream transcription.TTS
initVocoder()(WavTokenizer) →getFormattedAudioCompletion()→decodeAudioTokens()→ play PCM viaexpo-audio.Out of scope
llama.cppsupport lands.Sources: