tts: aggregate Hebrew TTS naturalness backlog from 2026-05-06 listening test

## Source
Native-speaker listening test on `sp_it_a_0001` (PR #90 main render and #89 approach-B render, 2026-05-06) during #89 evaluation. Both renders sounded the same — so this list is about the post-PR #90 baseline, not about anything #89 changed.

## Issues flagged (out of scope for #89's cap layer)

1. **Pitch goes up in jumps, not continuously.** M7 `SpeakerState` drift adds in 60% gap-closure increments per turn at intensity transitions; perceived as discrete step-changes rather than smooth escalation. Addressable in two ways: bound drift magnitude directly (originally enumerated as #89 approach C), or smooth drift with an EMA across multiple turns instead of a single-turn 60% closure.
2. **Robotic / emotion-less intonation.** Hebrew Azure neural voices (`he-IL-AvriNeural`, `he-IL-HilaNeural`) are baseline-flat; `<mstts:express-as>` style tags barely move Hebrew prosody. Anger barely audible on AGG at I4/I5; VIC essentially monotone across the arc. Fundamental TTS limitation — would need either (a) a different voice family (Google Chirp 3 HD, ElevenLabs, XTTS) or (b) post-render prosody augmentation (F0 contour shaping, expressive RVC).
3. **Wrong gender inflections.** Hebrew gender-marked verbs/adjectives in LLM output produced for the wrong speaker. #69 added bidirectional disambiguation but the fix is clearly partial. Naturalness damage is high; does **not** directly hurt WER because the synthesized audio matches the (wrong-gender) LLM ref text.
4. **Speech too slow.** Rate envelope `[0.85, 1.20]`. VIC at I5 floors at 0.85. Whisper-plausible WER driver — tracked separately in the issue spawned alongside this one (R: rate-floor lift).
5. **Breaks too long, intra- and inter-turn.** Inter-turn `pause_before_s` (mixer config; default 0.3 s but cumulative on mid-sentence breaks can exceed 1 s). Intra-turn breaks possibly residual from #70's inter-word break injection. Per CLAUDE.md "What NOT to do" the per-word `<break>` tags are locked by #83/#86 prompt rules — pause configuration would need to be addressed by mixer-side config, not SSML.
6. **Ulpan-tier phonetic errors** (non-gender, non-merging). Hebrew Azure voice fundamental quality. Distinct from #62 (word merging). Not addressable without changing voice family.

## Disposition
- (4) → spawned issue R.
- (1) → revisit if a future approach C (drift bounds) is justified by separate evidence.
- (2), (6) → blocked on alternative Hebrew TTS family. Out of scope until then.
- (3) → consider follow-up to #69 if naturalness damage from gender errors blocks downstream eval.
- (5) → mixer-side investigation, separate from cap layer. Open a follow-up if a downstream user complains.

## Note for future work
Native-speaker listening tests are expensive on Shay's time. Aggregate findings under one issue per session rather than fragmenting per symptom; only fork into a per-symptom issue when a specific fix is queued.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tts: aggregate Hebrew TTS naturalness backlog from 2026-05-06 listening test #92

Source

Issues flagged (out of scope for #89's cap layer)

Disposition

Note for future work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

tts: aggregate Hebrew TTS naturalness backlog from 2026-05-06 listening test #92

Description

Source

Issues flagged (out of scope for #89's cap layer)

Disposition

Note for future work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions