Skip to content

bug(tts): Hebrew word merging makes speech unintelligible #62

@shaypal5

Description

@shaypal5

Problem

Azure and Google TTS engines merge adjacent Hebrew words into single non-existent words, making most generated speech unintelligible to native speakers.

Source: Listening test, 2026-05-03 (IT + NEG clips).

Examples (from sp_it_a_0001_00)

Expected Rendered Meaning lost
"היי, חשבתי" → "hey, cha-shav-ti" "ha-ya-cha-shav-ti" (one long word) "hi" merged with "I thought"
"אגב ניסיתי" → "ah-gav nee-see-ti" "ag-va-nee-see-tee" "by the way" merged with "I tried"
"הייתי בדרך הביתה" → "ha-yee-ti ba-de-rech ha-bye-tah" "kee-nee-tee ba-de-rech ha-bee-tah-leh" Multiple words merged + mispronounced

This happens throughout all generated clips, not just isolated cases.

Likely causes

  1. Missing or insufficient whitespace/pause hints in SSML. Azure he-IL may need explicit <break> tags between words to prevent liaison.
  2. Nikud (vowel diacritics) absence. Without nikud, the TTS engine guesses word boundaries and vowelization, often incorrectly.
  3. Script generation producing unnormalized text. Stage 1b gender disambiguation may not be inserting enough structural cues.

Possible mitigations

  • Insert <break time="50ms"/> between words in SSML (at the renderer level)
  • Add nikud to high-error words via the normalization lexicon (Stage 1b)
  • Investigate if Google Chirp 3 HD handles word boundaries better than Azure
  • Test with explicit phoneme tags (<phoneme alphabet="ipa">) for problematic words

Impact

P0 — Blocker. If the speech is unintelligible, no downstream processing (labels, augmentation, training) has value.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcomp: ttsTTS rendering, SSML, Azure/Google providerstype: fixBug fix

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions