Problem
Azure and Google TTS engines merge adjacent Hebrew words into single non-existent words, making most generated speech unintelligible to native speakers.
Source: Listening test, 2026-05-03 (IT + NEG clips).
Examples (from sp_it_a_0001_00)
| Expected |
Rendered |
Meaning lost |
| "היי, חשבתי" → "hey, cha-shav-ti" |
"ha-ya-cha-shav-ti" (one long word) |
"hi" merged with "I thought" |
| "אגב ניסיתי" → "ah-gav nee-see-ti" |
"ag-va-nee-see-tee" |
"by the way" merged with "I tried" |
| "הייתי בדרך הביתה" → "ha-yee-ti ba-de-rech ha-bye-tah" |
"kee-nee-tee ba-de-rech ha-bee-tah-leh" |
Multiple words merged + mispronounced |
This happens throughout all generated clips, not just isolated cases.
Likely causes
- Missing or insufficient whitespace/pause hints in SSML. Azure he-IL may need explicit
<break> tags between words to prevent liaison.
- Nikud (vowel diacritics) absence. Without nikud, the TTS engine guesses word boundaries and vowelization, often incorrectly.
- Script generation producing unnormalized text. Stage 1b gender disambiguation may not be inserting enough structural cues.
Possible mitigations
- Insert
<break time="50ms"/> between words in SSML (at the renderer level)
- Add nikud to high-error words via the normalization lexicon (Stage 1b)
- Investigate if Google Chirp 3 HD handles word boundaries better than Azure
- Test with explicit phoneme tags (
<phoneme alphabet="ipa">) for problematic words
Impact
P0 — Blocker. If the speech is unintelligible, no downstream processing (labels, augmentation, training) has value.
Problem
Azure and Google TTS engines merge adjacent Hebrew words into single non-existent words, making most generated speech unintelligible to native speakers.
Source: Listening test, 2026-05-03 (IT + NEG clips).
Examples (from sp_it_a_0001_00)
This happens throughout all generated clips, not just isolated cases.
Likely causes
<break>tags between words to prevent liaison.Possible mitigations
<break time="50ms"/>between words in SSML (at the renderer level)<phoneme alphabet="ipa">) for problematic wordsImpact
P0 — Blocker. If the speech is unintelligible, no downstream processing (labels, augmentation, training) has value.