Skip to content

investigate(tts): #83 residual — Whisper WER regression on high-intensity (I3+) Tier A clips #87

@shaypal5

Description

@shaypal5

Spun out of PR #86 after the bisect that closed #83 turned out to be only a partial fix.

Background

PR #86 reverted PR #70 (inter-word `<break time="50ms"/>` tags) and verified the revert resolves the Whisper WER regression on `sp_neu_a_0001` (intensity arc `[1,1,1,2,1]` — almost all I1):

variant duration WER length-ratio
sp_neu 04-15 ref 121.0 s 0.079 1.000
sp_neu current main 197.8 s 0.286 0.753
sp_neu PR #86 (revert #70+#71) 122.9 s 0.048 0.995

But verifying the same revert on a high-intensity scene (`sp_it_a_0001`, intensity arc `[1,2,3,4,5,4,3]`) shows the regression persists:

variant duration WER length-ratio
sp_it 04-15 ref 155.9 s 0.056 1.009
sp_it PR #86 (revert #70+#71) 146.6 s 0.322 0.709

The Hebrew text is byte-identical between the 04-15 reference and the revert branch (verified by `diff` — empty). So the residual sp_it WER gap is purely audio rendering, not text content. Length-ratio 0.709 is the same Whisper-silence-detector fingerprint as #83 (~28% of words missing).

What this means

There is a second TTS-side change in the post-2026-04-15 window that fires at I3+. PR #86's central claim of "sole material cause" was wrong; #86 lands as a partial fix that closes nothing automatically. This issue tracks the actual residual cause.

Suspects (sorted by prior, re-derived)

This list is broader than the original #83 list because the original list missed a candidate (#74).

  1. fix(mixer): #65 Lombard spectral tilt at I4–I5 #74 — Lombard spectral tilt at I4–I5 (May 4, mixer.py). Modifies the spectral envelope of high-intensity audio. Strong prior: it acts at the exact intensities where the residual regression manifests, and changes spectral content (which is what Whisper's encoder consumes). This was missed in the original investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83 suspect list.
  2. feat(m15): SSML prosody tuning with research-validated Hebrew parameters #51 (M15) — SSML prosody tuning at I3+. At I3/I4/I5 the AGG rate multipliers are 1.06×/1.10×/1.14×. Was previously dismissed as "negligible" based on sp_neu's I1-only arc — that dismissal is invalid for sp_it.
  3. fix(config): halve pitch escalation at I4–I5 to eliminate helium effect #68 — pitch caps at I4/I5. Was dismissed for sp_neu (which never reaches I4); back in scope for sp_it.
  4. fix(mixer): #66 BARGE_IN audible-overlap crossfade past TTS trailing silence #75 — BARGE_IN crossfade. Lower prior; only relevant if sp_it has barge-ins.

Suggested investigation path

  1. Render `sp_it_a_0001` on current main (~17 Azure calls) to establish the actual "bad" baseline number we're missing. PR fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only) #86 only has the partial-fix WER, not the un-reverted WER.
  2. Bisect by additionally reverting one suspect at a time on top of the PR fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only) #86 branch (or its merged equivalent), in the order above. Render `sp_it_a_0001` with same `random_seed` per step, run Whisper, log WER + length-ratio.
  3. Identify the dominant contributor. The PR whose additional revert drops sp_it WER below ~0.10 (the 04-15 baseline range).
  4. Propose remediation to Shay before writing it. Same constraint as investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83 — do not change SSML/prosody defaults from M15 listening-test calibration without sign-off.

Reproduction (same harness as #83)

.venv/bin/python -c \"
from pathlib import Path
import soundfile as sf, torch, sys
from jiwer import wer
from transformers import pipeline
sys.path.insert(0, 'scripts')
from m17_phase_a_validation import normalize_for_wer

asr = pipeline('automatic-speech-recognition', model='openai/whisper-large-v3',
               device=torch.device('mps' if torch.backends.mps.is_available() else 'cpu'),
               torch_dtype=torch.float32, chunk_length_s=30)

for label, path in [
    ('04-15 ref', 'data/m2a_wettest/agg_m_30-45_001/sp_it_a_0001_00.wav'),
]:
    wav, sr = sf.read(path, dtype='float32')
    if wav.ndim > 1: wav = wav.mean(axis=1)
    txt = Path(path).with_suffix('.txt').read_text(encoding='utf-8')
    ref = '\n'.join(l for l in txt.splitlines() if l and not l.startswith('[')).strip()
    out = asr({'raw': wav.copy(), 'sampling_rate': sr},
              generate_kwargs={'language': 'he', 'task': 'transcribe',
                               'num_beams': 1, 'do_sample': False})
    print(f'{label}: WER={wer(normalize_for_wer(ref), normalize_for_wer(out[\\\"text\\\"])):.3f}')
\"

The 04-15 reference WAV exists at `data/m2a_wettest/agg_m_30-45_001/sp_it_a_0001_00.wav` (155.9 s, WER 0.056).

Things NOT to do

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions