Skip to content

feat: add on-device ASR (speech-to-text) and TTS (text-to-speech) #217

Description

@Avtrkrb

Summary

Add fully on-device ASR (speech-to-text input) and TTS (text-to-speech output) to the app, preserving the 100%-private mandate (no cloud STT/TTS). Both were removed in b859fc2 ("remove transcription + audio"); this issue tracks a redesigned, on-device implementation built on the existing llama.rn stack.

Current state

  • STT previously used whisper.rn and was removed in b859fc2; no TTS has ever existed.
  • expo-audio@~55.0.8 is a dependency and already exposes recording (useAudioRecorder) — only playback is wired today.
  • llama.rn@^0.11.4 already exports the required primitives: mtmd initMultimodal()/getMultimodalSupport() + input_audio for ASR, and initVocoder()/getFormattedAudioCompletion()/decodeAudioTokens() for TTS.
  • The attachments schema already supports type: "audio" + a duration field.
  • The download pipeline already handles main + mmproj multi-file bundles.

Recommended approach — two specialized models (no single model does both on stock llama.rn)

No upstream-supported llama.rn model performs both ASR and TTS (Qwen2.5-Omni understands audio-in but its speech-out side isn't supported upstream; unified audio models like LFM2.5-Audio need an unmerged llama.cpp PR). The cleanest path is two models, both first-class in llama.rn and requiring no forks:

Capability llama.rn path Candidate models (validate exact repo/quant before integration)
ASR mtmd input_audio via initMultimodal() Mungert/Qwen2.5-Omni-3B-GGUF / unsloth/Qwen2.5-Omni-3B-GGUF (main + mmproj), or Ultravox v0.5, or Qwen2-Audio
TTS initVocoder()getFormattedAudioCompletion()decodeAudioTokens() → PCM → expo-audio OuteAI/OuteTTS-0.3-1B-GGUF + its WavTokenizer vocoder GGUF (matches rn-tts.cpp's OuteTTS V0.3 handling)

ASR alternative (zero context contention): re-add whisper.rn (the app's original STT binding). It runs in its own native context and never competes with the chat LlamaContext, at the cost of a second native dependency.

Tasks

Cross-cutting

  • Extend MultimodalCapabilities (or add an AudioProvider) in src/ai-providers/.
  • Extend the download pipeline (main + mmproj) to N-file bundles (add vocoder + optional TTS speaker/tokenizer).
  • Add vocoderFilename/Url/Size (+ optional ttsSpeakerFile*) columns to hfModels + a migration.
  • Decide context strategy under the single-LlamaContext constraint: on-demand swap (release chat → load audio) vs. whisper.rn for ASR. Respect src/memory/budget.ts.

ASR

  • Re-add a useAudioRecorder hook on expo-audio (mic permission flow).
  • Load an audio mmproj via initMultimodal(); feed WAV/MP3 as input_audio in completion(); stream transcription.

TTS

  • initVocoder() (WavTokenizer) → getFormattedAudioCompletion()decodeAudioTokens() → play PCM via expo-audio.
  • Validate OuteTTS 0.3 end-to-end; evaluate OuteTTS 1.0 quality once vocoder-enable (mybigday/llama.rn#152) is confirmed.

Out of scope

  • Cloud-based STT/TTS (violates the privacy mandate).
  • Unified speech-in/speech-out omni models (e.g. LFM2.5-Audio, Qwen2.5-Omni Talker) until upstream llama.cpp support lands.

Sources:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions