Skip to content

tts: aggregate Hebrew TTS naturalness backlog from 2026-05-06 listening test #92

@shaypal5

Description

@shaypal5

Source

Native-speaker listening test on sp_it_a_0001 (PR #90 main render and #89 approach-B render, 2026-05-06) during #89 evaluation. Both renders sounded the same — so this list is about the post-PR #90 baseline, not about anything #89 changed.

Issues flagged (out of scope for #89's cap layer)

  1. Pitch goes up in jumps, not continuously. M7 SpeakerState drift adds in 60% gap-closure increments per turn at intensity transitions; perceived as discrete step-changes rather than smooth escalation. Addressable in two ways: bound drift magnitude directly (originally enumerated as fix(tts): #87 follow-up — close residual sp_it WER gap (0.129) toward baseline (0.056) without sacrificing M15 calibration #89 approach C), or smooth drift with an EMA across multiple turns instead of a single-turn 60% closure.
  2. Robotic / emotion-less intonation. Hebrew Azure neural voices (he-IL-AvriNeural, he-IL-HilaNeural) are baseline-flat; <mstts:express-as> style tags barely move Hebrew prosody. Anger barely audible on AGG at I4/I5; VIC essentially monotone across the arc. Fundamental TTS limitation — would need either (a) a different voice family (Google Chirp 3 HD, ElevenLabs, XTTS) or (b) post-render prosody augmentation (F0 contour shaping, expressive RVC).
  3. Wrong gender inflections. Hebrew gender-marked verbs/adjectives in LLM output produced for the wrong speaker. fix(script): gender-aware bidirectional Hebrew disambiguation #69 added bidirectional disambiguation but the fix is clearly partial. Naturalness damage is high; does not directly hurt WER because the synthesized audio matches the (wrong-gender) LLM ref text.
  4. Speech too slow. Rate envelope [0.85, 1.20]. VIC at I5 floors at 0.85. Whisper-plausible WER driver — tracked separately in the issue spawned alongside this one (R: rate-floor lift).
  5. Breaks too long, intra- and inter-turn. Inter-turn pause_before_s (mixer config; default 0.3 s but cumulative on mid-sentence breaks can exceed 1 s). Intra-turn breaks possibly residual from fix(tts): insert inter-word <break> tags to prevent Hebrew word merging #70's inter-word break injection. Per CLAUDE.md "What NOT to do" the per-word <break> tags are locked by investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83/fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only) #86 prompt rules — pause configuration would need to be addressed by mixer-side config, not SSML.
  6. Ulpan-tier phonetic errors (non-gender, non-merging). Hebrew Azure voice fundamental quality. Distinct from bug(tts): Hebrew word merging makes speech unintelligible #62 (word merging). Not addressable without changing voice family.

Disposition

  • (4) → spawned issue R.
  • (1) → revisit if a future approach C (drift bounds) is justified by separate evidence.
  • (2), (6) → blocked on alternative Hebrew TTS family. Out of scope until then.
  • (3) → consider follow-up to fix(script): gender-aware bidirectional Hebrew disambiguation #69 if naturalness damage from gender errors blocks downstream eval.
  • (5) → mixer-side investigation, separate from cap layer. Open a follow-up if a downstream user complains.

Note for future work

Native-speaker listening tests are expensive on Shay's time. Aggregate findings under one issue per session rather than fragmenting per symptom; only fork into a per-symptom issue when a specific fix is queued.

Metadata

Metadata

Assignees

No one assigned

    Labels

    comp: ttsTTS rendering, SSML, Azure/Google providersplanning

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions