Skip to content

spike(eval): M17 Phase C — multimodal LLM judge gates (script + report stub)#113

Merged
shaypal5 merged 3 commits into
mainfrom
spike/m17-phase-c-llm-judge
May 13, 2026
Merged

spike(eval): M17 Phase C — multimodal LLM judge gates (script + report stub)#113
shaypal5 merged 3 commits into
mainfrom
spike/m17-phase-c-llm-judge

Conversation

@shaypal5
Copy link
Copy Markdown
Member

Problem

Issues #92 and #97 are both blocked on the same input: a native-speaker listening test. The May-3 and May-6 sessions already consumed Shay's listening time on overlapping symptoms and we still don't have a closed loop on the I3-I5 distress-cue question or the broader naturalness backlog. Every TTS-side lever (mstts:express-as styles, disfluency density, per-phrase prosody jitter, Google Chirp VIC, …) currently requires a fresh listening test to validate — which means most of them stay unvalidated.

The eval-loop gap is well-scoped in docs/automated_eval_design.md §E5 (Multimodal LLM Judge). Phase A landed E1 (ASR) and blocked E2 (UTMOS) in docs/m17_phase_a_validation_report.md. E5 has been done once, manually, on a single clip (docs/debug_run_1/llm_feedbacks.md, 2026-04-14) — never as a repeatable workflow.

What this PR does

Lands the Phase C spike artifacts ready for execution. Same shape as the Phase A spike that produced docs/m17_phase_a_validation_report.md:

  • scripts/m17_phase_c_validation.py — the spike runner. 6 clips × 2 models (Gemini 2.5 Pro + GPT-4o-audio-preview) × 2 reruns. Structured JSON output via a fixed 8-dimension schema. Four independent gates per model:
  • docs/m17_phase_c_validation_report.md — narrative report stub. All numeric cells marked TBD; recommendation matrix pre-filled per gate-outcome scenario so the go/no-go is mechanical once the script runs.

Clip set

Label Source Kind Why
sp_sv_a_0001_00 corpus SV The canonical SV scene Shay + Gemini + Claude + ChatGPT reviewed manually in April. Provides ground-truth listening data to calibrate against.
sp_it_a_0001_00 corpus IT The I3-I5 distress-absent clip from issue #97.
sp_neg_a_0001_00 corpus NEG Hard-negative class: intense but has_violence: false.
sp_neu_a_0001_00 corpus NEU Neutral baseline — expected highest perceived quality.
sp_sv_a_0001_00__wn_snr_+10db degraded SV Mild noise — mid-anchor.
sp_sv_a_0001_00__wn_snr_-10db degraded SV Severe noise — discrimination negative anchor.

Degradations use the same apply_white_noise + rms_normalize_to_match helpers as Phase A so the spectral content (not loudness) is what the judge has to discriminate.

Pre-flight cost

estimated cost: $3.05  (25.8 min audio × 2 reruns × 2 models)
  gemini: $0.35
  openai: $2.70

Hard cap: SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD = $5 (env var, configurable). Script aborts pre-flight if estimate exceeds cap and mid-run if cumulative actual exceeds.

What happens after this lands

  1. Wire keys + run the spike. GEMINI_API_KEY + OPENAI_API_KEY in .envrc. Single execution, ~$3 of spend, ~30 min wall-clock. Fills in the TBD cells in docs/m17_phase_c_validation_report.md and writes state/spikes/m17_phase_c/results.json.
  2. Make the go/no-go call from the gate table. The report's Recommendation matrix is pre-filled per outcome scenario, so this is mechanical.
  3. If GO → open follow-up issues for the Phase C MVP: synthbanshee/eval/llm_judge.py module, anchor-set regression CI workflow, calibration against Shay's listening tests.
  4. If NO-GO → record the failure mode and pivot the eval roadmap (paid linguistic raters, crowd eval, or wait for next-generation multimodal models).

Test plan

  • --dry-run mode prints the prompt + JSON schema without calling any API — verified locally; output reviewed for clarity and content-safety framing
  • Cost estimate matches expected pricing for Gemini 2.5 Pro audio + GPT-4o-audio-preview at the 6-clip × 2-rerun × 2-model fan-out
  • Degraded variant materialisation writes state/spikes/m17_phase_c/sp_sv_a_0001_00__wn_snr_*.wav under the gitignored spike dir
  • ruff check + ruff format --check pass on scripts/m17_phase_c_validation.py
  • mypy scripts/m17_phase_c_validation.py passes
  • Deferred until keys are wired — single end-to-end run on real APIs; fill TBD cells in the report; make go/no-go call

Not in scope for this PR

  • Wiring GEMINI_API_KEY / OPENAI_API_KEY to .envrc — defer until you've decided to run the spend
  • Production synthbanshee/eval/llm_judge.py module — Phase C MVP, conditional on this spike passing
  • Anchor-set regression CI workflow — same, MVP-phase
  • Updating docs/implementation_plan.md to reflect Phase C kickoff — handle separately after the spike runs

…t stub)

Mirrors the Phase A spike protocol (scripts/m17_phase_a_validation.py +
docs/m17_phase_a_validation_report.md) for E5 — the Multimodal LLM Judge
evaluator from docs/automated_eval_design.md §E5. This unlocks the eval
loop that issues #92 and #97 are currently blocked on (both require
native-speaker listening which doesn't scale).

Script (scripts/m17_phase_c_validation.py):

- 6-clip set: 4 corpus typology spans (SV/IT/NEG/NEU, all M2a-wettest
  agg_m_30-45_001) + 2 degraded variants of sp_sv_a_0001_00 (white noise
  at +10 dB and -10 dB SNR, RMS-matched to the clean source so the
  judge is scoring spectral content, not amplitude).
- Two models: gemini-2.5-pro (audio input) and gpt-4o-audio-preview.
- Two reruns per (clip, model) for within-model variance measurement.
- Structured JSON output via a fixed schema (8 quality dimensions on a
  1-5 scale + artifacts_detected + confidence_in_assessment + summary).
- Four gate outcomes per model:
    refusal_gate         ≥ 5/6 clips scored under DV-research framing
    discrimination_gate  mean(corpus) − mean(severe-degraded) ≥ 0.5
    variance_gate        per-dim std across 2 reruns ≤ 0.5
    shay_correlation_gate Spearman ρ ≥ 0.3 vs encoded expected ranking
- Gemini BLOCK_NONE safety per design doc §E5 content-sensitivity
  guidance (research framing only — DV-content tolerance is required to
  avoid refusals on the metadata; the audio is entirely synthetic and
  contains no real persons).
- Hard budget cap via SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD (default $5);
  aborts pre-flight on estimate, aborts mid-run on cumulative actual.
- --dry-run flag prints the prompt + schema without calling any API,
  for pre-flight prompt audit.
- Lazy imports for google-genai / openai so the module is cheap to
  import even without the deps installed.

Report stub (docs/m17_phase_c_validation_report.md):

- Same structure as the Phase A report (TL;DR gate table, Reproduce,
  Clip-set manifest, Gate definitions, Per-model results, Failure-mode
  notes, Recommendation matrix, Limitations, Cost report).
- All numeric cells marked TBD pending real-run results.
- Recommendation matrix is pre-filled per gate-outcome scenario so the
  go/no-go decision is mechanical once the script runs.

Cost estimate at default settings: ~$3.05 total ($0.35 Gemini + $2.70
GPT-4o), well within the $5 cap. Dry-run verified end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 13, 2026 20:28
@shaypal5 shaypal5 added planning comp: tts TTS rendering, SSML, Azure/Google providers labels May 13, 2026
@github-actions

This comment has been minimized.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the Phase C “multimodal LLM judge” spike artifacts for M17 E5, enabling a repeatable evaluation run (Gemini + OpenAI audio models) with per-model gate outcomes and a report template to unblock issues #92/#97 without additional immediate listening tests.

Changes:

  • Introduces scripts/m17_phase_c_validation.py to run a 6-clip × 2-model × 2-rerun structured-scoring experiment and compute four acceptance gates.
  • Adds docs/m17_phase_c_validation_report.md as the narrative report stub, meant to be filled from spike outputs in state/spikes/m17_phase_c/.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
scripts/m17_phase_c_validation.py New Phase C spike runner: clip prep (incl. degradations), model calls, cost/budget handling, gate evaluation, and auto-report generation.
docs/m17_phase_c_validation_report.md New report stub describing gates, clip set, reproduction steps, and how to interpret outcomes.
Comments suppressed due to low confidence (2)

scripts/m17_phase_c_validation.py:223

  • The docstring and inline docs refer to “structured JSON responses against a Pydantic schema” and include pydantic in the install command, but the script never imports/uses Pydantic; it builds a plain JSON Schema dict. Please either switch to an actual Pydantic model (and derive JSON Schema from it) or update the docs/install instructions to remove the Pydantic reference so the setup instructions match reality.
# --- Pydantic schema ---------------------------------------------------------
def make_response_schema() -> dict:
    """Build the JSON Schema used to constrain both Gemini and OpenAI output."""
    return {
        "type": "object",
        "properties": {
            **{d: {"type": "integer", "minimum": 1, "maximum": 5} for d in DIMENSIONS},
            "artifacts_detected": {"type": "boolean"},
            "artifact_notes": {"type": "string"},
            "confidence_in_assessment": {"type": "integer", "minimum": 1, "maximum": 5},
            "summary": {"type": "string"},
        },
        "required": [
            *DIMENSIONS,
            "artifacts_detected",
            "artifact_notes",
            "confidence_in_assessment",
            "summary",
        ],
        "additionalProperties": False,
    }

scripts/m17_phase_c_validation.py:544

  • Same as Gemini path: if JSON decoding fails, the run is marked refused but error is left None. Recording the JSONDecodeError (or at least a sentinel like "invalid_json") would make it much easier to triage schema/prompt issues from true policy refusals.
    raw = resp.choices[0].message.content or ""
    try:
        parsed = json.loads(raw)
        refused = False
    except json.JSONDecodeError:
        parsed = None
        refused = True


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/m17_phase_c_validation.py Outdated
Comment on lines +845 to +852
if cumulative_usd >= budget_cap:
print(
f"ABORT: cumulative ${cumulative_usd:.2f} ≥ budget ${budget_cap:.2f}. "
"Partial results will still be written.",
flush=True,
)
break
if model == "gemini":
Comment on lines +331 to +339
# Per Anthropic/Google/OpenAI pricing snapshots as of January 2026. Update if
# the spike is rerun against newer model versions. These are upper-bound
# estimates: actual cost is metered per-token by each provider and the
# script's running total uses that, not these.
GEMINI_AUDIO_USD_PER_MIN = 0.0125 # gemini-2.5-pro audio input
GEMINI_OUTPUT_USD_PER_KTOK = 0.0050
OPENAI_AUDIO_INPUT_USD_PER_MIN = 0.10 # gpt-4o-audio-preview
OPENAI_OUTPUT_USD_PER_KTOK = 0.020

Comment on lines +720 to +721
for model, g in payload["gates"].items():
ref = f"{g['refusal_gate']['scored_clips']}/{g['refusal_gate']['min_required']}"
.venv/bin/python scripts/m17_phase_c_validation.py
.venv/bin/python scripts/m17_phase_c_validation.py --dry-run # prompts only, no API calls

Cost ceiling (rough): ~$0.50–1.50 total across both models at 6 clips × 2 reruns.
Comment on lines +31 to +49
```bash
uv pip install --python .venv/bin/python \
google-genai openai pydantic soundfile numpy scipy

export GEMINI_API_KEY=...
export OPENAI_API_KEY=...
export SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD=5.0 # hard cap

# Optional dry run — prints the prompt + JSON schema, no API calls, no spend.
.venv/bin/python scripts/m17_phase_c_validation.py --dry-run

# Full run.
.venv/bin/python scripts/m17_phase_c_validation.py
```

The script prepares 6 clip records (4 corpus typology spans + 2 degraded
variants of `sp_sv_a_0001_00`), sends each to each model twice, parses
structured JSON responses against a Pydantic schema, and writes gate
outcomes to `results.json` + an auto-generated markdown summary.
Comment on lines +457 to +464
raw = resp.text or ""
try:
parsed = json.loads(raw)
refused = False
except json.JSONDecodeError:
parsed = None
refused = True

Comment on lines +86 to +88
# ordinal: 4 = best perceived, 1 = worst. Ties are deliberate where evidence
# doesn't separate clips. Used only for the Spearman gate; if you don't have
# a strong prior for a clip, leave it tied with neighbours rather than guess.
…ore running

Critical:
- Replace wn_snr_+10db mid-anchor with synth_rate_slow_0.7x — tests synthesis
  defect detection (unnatural tempo + pitch-shift + truncated arc) instead of
  measuring only whether the model can hear noise
- Add --probe-metadata-bias flag: runs a no_arc prompt variant on corpus clips
  to measure whether emotional_expression / escalation_arc are scored from audio
  or inferred from the intensity-arc label; delta is reported per-clip per-dim
  in both results.json and report_auto.md

High:
- Add failure_reason field to JudgeResult ("ok" / "content_refusal" /
  "json_parse_error" / "api_error") so only content refusals count against the
  refusal gate; api_error and json_parse_error are retried, not penalised
- Incremental write: results_partial.jsonl is appended after each call;
  --resume loads prior results and skips completed (clip, model, run, variant)
  tuples — an API failure or budget-cap abort no longer loses all spend
- Drop shay_correlation and variance from overall_pass (advisory only);
  add notes: variance is trivially PASS at TEMPERATURE=0 greedy decoding,
  Shay rho is not statistically significant at n=4
- Discrimination gate now tests two independent arms — noise_corruption and
  synth_failure — reporting both separations; passing either arm clears the gate

Low:
- Fix report stub: gate table columns aligned with report_auto.md output,
  hand-written per-run tables removed (replaced with pointers to auto-report),
  limitations section updated to reflect all of the above

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@shaypal5 shaypal5 added this to the M17 milestone May 13, 2026
@shaypal5 shaypal5 added comp: eval Automated evaluation, LLM judges, MOS/UTMOS, ASR metrics and removed comp: tts TTS rendering, SSML, Azure/Google providers labels May 13, 2026
…imination gate

Update the gate bullet-list in the module docstring to match what the code
actually does after the review-fixes commit: two discrimination arms
(noise_corruption comparability + synth_failure real target), advisory notes
on variance and Shay-correlation, failure_reason semantics for refusal gate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 13, 2026 20:58
@github-actions
Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #113 in repository https://github.com/DataHackIL/SynthBanshee

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: scripts/m17_phase_c_validation.py
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267579
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    Mid-run budget abort only breaks out of the innermost rerun loop, so the script will continue iterating over subsequent clips/models and repeatedly print the ABORT message without actually stopping the run. Consider breaking out of all loops (e.g., return immediately, raise SystemExit, or use a flag checked at each nesting level) so the budget cap reliably halts further API calls and reduces confusing output.

## COPILOT-2
Location: scripts/m17_phase_c_validation.py:429
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267639
Root author: copilot-pull-request-reviewer

Comment:
    The comments claim the “running total uses” provider-metered costs, but usd_cost is computed from hard-coded price constants plus audio duration + output tokens only. This can undercount (e.g., ignores text prompt tokens and any provider-specific audio token accounting), making the mid-run budget cap unreliable. Either (a) compute cost from provider usage fields that correspond to billable units (if available) or (b) clearly label usd_cost/cumulative_usd as an estimate and enforce the budget conservatively (e.g., include prompt tokens, add a safety margin, or stop based on estimated remaining spend).

## COPILOT-3
Location: scripts/m17_phase_c_validation.py:957
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267673
Root author: copilot-pull-request-reviewer

Comment:
    Refusal gate display uses `{scored_clips}/{min_required}` (e.g., `6/5`), which is confusing because the criterion is “≥ 5/6 clips scored”. It would be clearer to display `{scored_clips}/{len(clips)}` (and optionally keep the threshold separately) so the table communicates the actual denominator.

## COPILOT-4
Location: scripts/m17_phase_c_validation.py:49
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267704
Root author: copilot-pull-request-reviewer

Comment:
    The top-level docstring’s cost estimate (“~$0.50–1.50 total across both models”) appears inconsistent with the script’s own pricing constants (OpenAI audio input alone is $0.10/min, so ~25+ minutes of audio × reruns can exceed that). Please update the stated range (or remove it) so it doesn’t mislead someone about expected spend.

    This issue also appears on line 202 of the same file.

## COPILOT-5
Location: docs/m17_phase_c_validation_report.md:57
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267738
Root author: copilot-pull-request-reviewer

Comment:
    This report (and the reproduce install command) says the script “parses structured JSON responses against a Pydantic schema” and lists `pydantic` as a dependency, but `scripts/m17_phase_c_validation.py` currently uses a hand-built JSON Schema dict and does not import Pydantic. Update the report to match the implementation (or update the script to actually use Pydantic-derived schema) so the reproduction steps are accurate.

## COPILOT-6
Location: scripts/m17_phase_c_validation.py:610
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267759
Root author: copilot-pull-request-reviewer

Comment:
    On JSON parse failure, the result is marked refused but `error` remains None, which makes it harder to debug whether the model refused vs. returned malformed JSON vs. a truncation issue. Consider setting `error` (e.g., to the JSONDecodeError message) when decoding fails so partial runs have actionable diagnostics.

    This issue also appears on line 537 of the same file.

## COPILOT-7
Location: scripts/m17_phase_c_validation.py:100
URL: https://github.com/DataHackIL/SynthBanshee/pull/113#discussion_r3237267787
Root author: copilot-pull-request-reviewer

Comment:
    The comment says `expected_quality_rank` is an ordinal where 4=best and 1=worst, but the degraded severe clip is assigned rank 0. Either adjust the encoding comment (e.g., allow 0 as “worse than worst corpus”) or keep the ranks within the documented 1–4 range to avoid confusion when interpreting `results.json` / manifest tables.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25826048300 attempt 1
Comment timestamp: 2026-05-13T20:59:57.420454+00:00
PR head commit: 7598cb08412d8e4ee826eddf9d654c0029a34fbe

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

Comment on lines +49 to +50
Cost ceiling (rough): ~$0.50–1.50 total across both models at 6 clips × 2 reruns.
The hard cap defaults to $5; raise/lower via SYNTHBANSHEE_LLM_SPIKE_BUDGET_USD.
Comment on lines +611 to +617
usage = getattr(resp, "usage_metadata", None)
in_tok = getattr(usage, "prompt_token_count", None) or 0
out_tok = getattr(usage, "candidates_token_count", None) or 0
cost = (clip.duration_s / 60.0) * GEMINI_AUDIO_USD_PER_MIN + (
out_tok / 1000.0
) * GEMINI_OUTPUT_USD_PER_KTOK

Comment on lines +706 to +708
cost = (clip.duration_s / 60.0) * OPENAI_AUDIO_INPUT_USD_PER_MIN + (
out_tok / 1000.0
) * OPENAI_OUTPUT_USD_PER_KTOK
Comment on lines +784 to +788
scored_clips = {
r.clip_label for r in runs if r.failure_reason not in ("content_refusal",) and r.parsed
}
content_refusals = [r for r in runs if r.failure_reason == "content_refusal"]
refusal_pass = len(scored_clips) >= REFUSAL_GATE_MIN_SCORED
]
print(f"call plan: {len(call_plan)} total, {len(pending)} pending", flush=True)

# --- Run -----------------------------------------------------------------
Setup:

uv pip install --python .venv/bin/python \\
google-genai openai pydantic soundfile numpy scipy

```bash
uv pip install --python .venv/bin/python \
google-genai openai pydantic soundfile numpy scipy
@shaypal5 shaypal5 merged commit fc65a83 into main May 13, 2026
7 checks passed
@shaypal5 shaypal5 deleted the spike/m17-phase-c-llm-judge branch May 13, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: eval Automated evaluation, LLM judges, MOS/UTMOS, ASR metrics planning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants