You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Native-speaker listening test on sp_it_a_0001 (PR #90 main render and #89 approach-B render, 2026-05-06) during #89 evaluation. Both renders sounded the same — so this list is about the post-PR #90 baseline, not about anything #89 changed.
Pitch goes up in jumps, not continuously. M7 SpeakerState drift adds in 60% gap-closure increments per turn at intensity transitions; perceived as discrete step-changes rather than smooth escalation. Addressable in two ways: bound drift magnitude directly (originally enumerated as fix(tts): #87 follow-up — close residual sp_it WER gap (0.129) toward baseline (0.056) without sacrificing M15 calibration #89 approach C), or smooth drift with an EMA across multiple turns instead of a single-turn 60% closure.
Robotic / emotion-less intonation. Hebrew Azure neural voices (he-IL-AvriNeural, he-IL-HilaNeural) are baseline-flat; <mstts:express-as> style tags barely move Hebrew prosody. Anger barely audible on AGG at I4/I5; VIC essentially monotone across the arc. Fundamental TTS limitation — would need either (a) a different voice family (Google Chirp 3 HD, ElevenLabs, XTTS) or (b) post-render prosody augmentation (F0 contour shaping, expressive RVC).
Wrong gender inflections. Hebrew gender-marked verbs/adjectives in LLM output produced for the wrong speaker. fix(script): gender-aware bidirectional Hebrew disambiguation #69 added bidirectional disambiguation but the fix is clearly partial. Naturalness damage is high; does not directly hurt WER because the synthesized audio matches the (wrong-gender) LLM ref text.
Speech too slow. Rate envelope [0.85, 1.20]. VIC at I5 floors at 0.85. Whisper-plausible WER driver — tracked separately in the issue spawned alongside this one (R: rate-floor lift).
(5) → mixer-side investigation, separate from cap layer. Open a follow-up if a downstream user complains.
Note for future work
Native-speaker listening tests are expensive on Shay's time. Aggregate findings under one issue per session rather than fragmenting per symptom; only fork into a per-symptom issue when a specific fix is queued.
Source
Native-speaker listening test on
sp_it_a_0001(PR #90 main render and #89 approach-B render, 2026-05-06) during #89 evaluation. Both renders sounded the same — so this list is about the post-PR #90 baseline, not about anything #89 changed.Issues flagged (out of scope for #89's cap layer)
SpeakerStatedrift adds in 60% gap-closure increments per turn at intensity transitions; perceived as discrete step-changes rather than smooth escalation. Addressable in two ways: bound drift magnitude directly (originally enumerated as fix(tts): #87 follow-up — close residual sp_it WER gap (0.129) toward baseline (0.056) without sacrificing M15 calibration #89 approach C), or smooth drift with an EMA across multiple turns instead of a single-turn 60% closure.he-IL-AvriNeural,he-IL-HilaNeural) are baseline-flat;<mstts:express-as>style tags barely move Hebrew prosody. Anger barely audible on AGG at I4/I5; VIC essentially monotone across the arc. Fundamental TTS limitation — would need either (a) a different voice family (Google Chirp 3 HD, ElevenLabs, XTTS) or (b) post-render prosody augmentation (F0 contour shaping, expressive RVC).[0.85, 1.20]. VIC at I5 floors at 0.85. Whisper-plausible WER driver — tracked separately in the issue spawned alongside this one (R: rate-floor lift).pause_before_s(mixer config; default 0.3 s but cumulative on mid-sentence breaks can exceed 1 s). Intra-turn breaks possibly residual from fix(tts): insert inter-word <break> tags to prevent Hebrew word merging #70's inter-word break injection. Per CLAUDE.md "What NOT to do" the per-word<break>tags are locked by investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83/fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only) #86 prompt rules — pause configuration would need to be addressed by mixer-side config, not SSML.Disposition
Note for future work
Native-speaker listening tests are expensive on Shay's time. Aggregate findings under one issue per session rather than fragmenting per symptom; only fork into a per-symptom issue when a specific fix is queued.