FidelityBench: introduce LLM Judge, v2#1
Conversation
The clean Alex judge stored priorOutcomeOrigin via findFirstUserTurnMatching
but never used it — the comment "currently unused but kept for future" was
shipped. As a result the scenario claimed a 7-fact relational constellation
but actually scored 6.
Adds the 9th intent dimension: uses_prior_outcome. Detects whether the
draft mirrors the user's prior successful pushback pattern with Alex
("named the risk early" — explicitly on record as having worked).
Detection is lexical-floor only via PRIOR_OUTCOME_PATTERNS (8 positives /
0 false positives on hand-test fixtures). A future LLM-judge augmentation
can catch paraphrases that don't trip the regex.
No memory-laundering guard: there is no corresponding RecallBurdenCategory
for "prior outcome" yet. Documented inline so a future PR can add
prior_outcome to PATTERNS.
Rubric: maxScore 100->105 (+5 from new dim), maxIntentFidelity 40->45.
Architecture-discriminating overflow variant of alex_pushback_001.
Composition:
- 7 load-bearing facts (6 originals + supersession event)
- 80 noise turns from a Gemini-authored corpus (authorship separation
per spec v2 §3.4; Gemini received pool definitions only, never the
load-bearing facts themselves)
- octile placement so windowed-transcript(K=20) misses 5 of 7
load-bearing facts but still sees the recent constraint + supersession
Configuration:
FIDELITYBENCH_OVERFLOW_N=80 total noise turns (default 80)
FIDELITYBENCH_OVERFLOW_SEED=42 deterministic noise sampling + placement
New judge dimensions on top of clean Alex's 9:
- attribution_fidelity (5pt): no Maya/Sarah/etc confusion with Alex
- honors_latest_intent (5pt): reflects the post-launch fix-window update
instead of zombie-intent Friday-risky framing
Rubric: 30 task + 55 intent (11 × 5) + 15 recall + 10 clar + 5 tools = 115.
Noise corpus audit (PASS):
- 0 forbidden phrases
- 0 (Alex, Friday/scope/launch/risk/timeline/unreliable) co-occurrences
- 53% near-distractor density (target ≥25%)
- 121 turns total across 9 pools
Spec reference: /tmp/alex_pushback_overflow_001.spec.v2.md
Adversarial review by codex CLI flagged three HIGH-severity scoring bugs in v0.2 overflow + one MEDIUM in shared prior-outcome detection. This commit applies all four; the v0.2 pilot already ran with the unfixed scoring (results retained for comparison; see overnight status report for caveats). H1 + H2: honors_latest_intent now requires absence of zombie patterns Before: a draft saying "Friday is risky, push to Tuesday, mention a post-launch fix window" tripped both LATEST_INTENT_HONOR_PATTERNS and ZOMBIE_INTENT_PATTERNS but still received the dim because zombie was only noted when !honorsLatest. This rewarded mixed old/new framing. Now: honorsLatestHonored = honorsLatest && !zombieIntent && !askedConstraint. A new "MIXED INTENT" judge note is emitted for the both-tripped case. H3: task_success now requires honors_latest_intent for the 30-bracket Before: an agent that ignored the supersession entirely could score 30/30 on task by presenting only the superseded "scope-cut + Tuesday" framing, losing only 5 points on the latest-intent dim. Now: 30 requires honorsLatestHonored. Stale tradeoff drops to 20. Generic engagement drops to 15. No-engagement still 10. M3: PRIOR_OUTCOME_PATTERNS tightened in BOTH clean and overflow files Three overly-generic patterns dropped: "before (we) commit", "better to flag/raise/...", "worth flagging". A normal pushback draft was earning the dim without any memory of Alex's prior good response. Remaining patterns require explicit early/risk-naming framing OR an intent-to-name verb anchored to risk/concern/timeline. Spec ref: /tmp/codex-findings-v0.2.md
Codex's adversarial review identified that honors_latest_intent is
trivially gameable by lexical regex: a draft saying "I'm concerned
about the post-launch fix window approach" passes the dim because the
phrase "post-launch fix window" appears, even though the draft is
explicitly rejecting the supersession.
The pilot confirmed this. The hybrid agent's draft scored 110/115
under v0.2 lexical scoring, but the actual draft says "I cannot commit
to Friday. Safer path: Tuesday gives us full scope" — clearly NOT
honoring the supersession. The lexical regex couldn't tell.
This commit adds a strict-by-design LLM judge augmentation:
1. New ScenarioAsyncJudge type in src/types.ts. Optional async
post-processing of a sync judge's result. Safety property: the
async judge can DOWNGRADE a regex-honored dim to fail, but cannot
upgrade a regex-failed dim to honored. That keeps paraphrase-
tolerance from laundering credit.
2. New module src/judges/honorsLatestLLMJudge.ts. Calls the configured
LLM provider (Bedrock Sonnet 4.5 by default). System prompt
includes 5 hand-labeled examples (2 honored, 3 not honored)
covering the lexical-pass-but-semantically-zombie case. Returns
{honors: bool, evidence: string}.
3. Runner awaits asyncJudge if defined. Errors gracefully fall back
to the sync result with a "[asyncJudge skipped]" note.
4. Overflow scenario exports asyncJudge that re-evaluates only the
honors_latest_intent dim, only when lexical regex already passed.
On LLM downgrade: dim flips to false, intent_fidelity recomputed,
task_success recomputed under v0.2-rc rules (30-bracket requires
honors_latest), total recomputed. Adds "LLM JUDGE DOWNGRADE" note
for audit.
Validation (3 hand-test calls):
- Hybrid agent's actual pilot draft → DOWNGRADED ✓ ("recommends Tuesday
despite the superseding update")
- Positive control (genuine fix-window endorsement) → HONORED ✓
- Strong negative control (Tuesday safer / Friday risky) → NOT HONORED ✓
Disable with FIDELITYBENCH_DISABLE_LLM_JUDGE=1 if you want to compare
lexical-only vs LLM-augmented scoring on the same trials.
Cost per trial: ~$0.005 (one judge call when lexical passed).
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional “async judge” layer to augment existing regex-based judges with an LLM-based verification step (downgrade-only), adds LLM usage/cost tracking + budgeting, and ships a new Alex overflow scenario plus related artifacts/docs.
Changes:
- Add
ScenarioAsyncJudge+ runner support to await best-effort async judge augmentation with downgrade-only enforcement. - Add LLM usage recording, cost estimation, and a max-cost budget guard for LLM calls; emit usage in CLI output / JSON artifacts.
- Add an LLM judge for
honors_latest_intentand a newalex_pushback_overflow_001scenario using it, plus golden checks and exploratory result artifacts/docs.
Reviewed changes
Copilot reviewed 24 out of 26 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/types.ts | Adds ScenarioAsyncJudge and asyncJudgeResults to support async/LLM judge augmentation and auditing. |
| src/runner.ts | Awaits optional asyncJudge and enforces “downgrade-only” semantics (with fallback-on-error behavior). |
| src/llm/usage.ts | New usage/cost tracking module, including budget cap enforcement. |
| src/llm/client.ts | Adds labeling + usage recording hooks and budget check before calls. |
| src/judges/honorsLatestLLMJudge.ts | New LLM-based verifier + parser for honors_latest_intent semantic validation. |
| src/index.ts | Resets usage per run; prints and emits llmUsage; adds optional loading of overflow scenario. |
| src/golden.ts | Adds lightweight golden checks for verdict parsing and the overflow judge. |
| src/agents/WindowedTranscriptLLMAgent.ts | Adds per-call labels for usage attribution. |
| src/agents/TranscriptLLMAgent.ts | Adds per-call labels for usage attribution. |
| src/agents/StatelessLLMAgent.ts | Adds per-call labels for usage attribution. |
| src/agents/GraphMemoryLLMAgent.ts | Adds per-call labels for usage attribution (extract/respond). |
| src/agents/FileMemoryLLMAgent.ts | Adds per-call labels for usage attribution (memory_update/respond). |
| src/agents/BlockMemoryLLMAgent.ts | Adds per-call labels for usage attribution (extract/respond). |
| scenarios/data/alex_pushback_overflow_001.noise.json | Adds the noise corpus used to construct the overflow timeline. |
| scenarios/alex_pushback_overflow_001.ts | New overflow scenario, judge, and downgrade-only async LLM augmentation for honors_latest_intent. |
| scenarios/alex_pushback_001.ts | Adds a new uses_prior_outcome dimension and updates scoring ceilings. |
| results/v0.3-exploratory-N20-windowed.json | Adds exploratory run artifact including llmUsage records. |
| results/v0.3-exploratory-N20-hybrid.json | Adds exploratory run artifact including llmUsage records. |
| results/v0.3-exploratory-N0-windowed.json | Adds exploratory run artifact including llmUsage records. |
| results/v0.3-exploratory-N0-hybrid.json | Adds exploratory run artifact including llmUsage records. |
| package.json | Adds npm run golden script. |
| docs/scorecards/v0.3-exploratory-alex-overflow.md | Documents exploratory smoke results, costs, and commands. |
| docs/prereg/alex-overflow-v0.3.md | Adds prereg candidate protocol including LLM judge + budgeting requirements. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ).map((entry) => ({ | ||
| ...entry, | ||
| estimatedCostUsd: Math.round(entry.estimatedCostUsd * 10000) / 10000, | ||
| })), | ||
| records, | ||
| } |
There was a problem hiding this comment.
getLlmUsageSummary() returns the module-level records array directly. Callers can mutate it (e.g., summary.records.push(...) or summary.records.length = 0) which would corrupt usage tracking and budget checks. Consider returning a shallow copy (and/or frozen copy) of records instead.
| export type LlmCallOptions = { | ||
| messages: LlmMessage[] | ||
| responseFormat?: "text" | "json_object" | ||
| temperature?: number | ||
| maxTokens?: number | ||
| label?: string | ||
| } |
There was a problem hiding this comment.
LlmCallOptions includes responseFormat, but callLlm/callAnthropic/callOpenAi/callBedrock never use it when calling generateText. With the new LLM judge relying on strict JSON, this option being a no-op is easy to misinterpret. Either plumb responseFormat through to the underlying SDK (if supported) or remove/rename it to avoid a misleading API surface.
| let augmented = result | ||
| if (asyncJudge) { | ||
| try { | ||
| augmented = enforceAsyncJudgeDowngradeOnly( | ||
| result, | ||
| await asyncJudge(result), | ||
| ) |
There was a problem hiding this comment.
asyncJudge is invoked with the same result object that is later used as the original baseline in enforceAsyncJudgeDowngradeOnly. If an async judge mutates the passed-in object (even accidentally) and returns it, the baseline will already be modified and the downgrade-only enforcement can be bypassed. Consider passing a deep-cloned (or frozen) copy into asyncJudge, and comparing against an immutable snapshot of the original result.
| if (augmented[field] > original[field]) { | ||
| throw new Error( | ||
| `asyncJudge attempted to increase ${field}: ${original[field]} -> ${augmented[field]}`, | ||
| ) | ||
| } |
There was a problem hiding this comment.
enforceAsyncJudgeDowngradeOnly only checks augmented[field] > original[field]. If an async judge returns NaN (or a non-finite number) for any score field, the comparison will be false and the invalid result will slip through, potentially breaking reporting/aggregation. Consider validating that both original and augmented score fields are finite numbers (and throwing if not) before comparing.
| if (augmented[field] > original[field]) { | |
| throw new Error( | |
| `asyncJudge attempted to increase ${field}: ${original[field]} -> ${augmented[field]}`, | |
| ) | |
| } | |
| const originalValue = original[field] | |
| const augmentedValue = augmented[field] | |
| if (!Number.isFinite(originalValue)) { | |
| throw new Error( | |
| `original evaluation has invalid ${field}: ${String(originalValue)}`, | |
| ) | |
| } | |
| if (!Number.isFinite(augmentedValue)) { | |
| throw new Error( | |
| `asyncJudge produced invalid ${field}: ${String(augmentedValue)}`, | |
| ) | |
| } | |
| if (augmentedValue > originalValue) { | |
| throw new Error( | |
| `asyncJudge attempted to increase ${field}: ${originalValue} -> ${augmentedValue}`, | |
| ) | |
| } |
| export function assertLlmBudgetRemaining() { | ||
| const cap = process.env.FIDELITYBENCH_MAX_COST_USD | ||
| if (!cap) return | ||
| const max = Number.parseFloat(cap) | ||
| if (!Number.isFinite(max) || max <= 0) return | ||
| const spent = getLlmUsageSummary().estimatedCostUsd | ||
| if (spent >= max) { | ||
| throw new Error( | ||
| `FIDELITYBENCH_MAX_COST_USD exceeded before LLM call: $${spent.toFixed(4)} >= $${max.toFixed(2)}`, | ||
| ) | ||
| } | ||
| } |
There was a problem hiding this comment.
assertLlmBudgetRemaining() uses getLlmUsageSummary().estimatedCostUsd, but getLlmUsageSummary rounds estimatedCostUsd to 4 decimals. This can undercount spent and allow the process to exceed FIDELITYBENCH_MAX_COST_USD by a non-trivial amount over many calls. For budget enforcement, compute spent from the raw records sum (unrounded), and only round for display/reporting.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 188b60dbc8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const spent = getLlmUsageSummary().estimatedCostUsd | ||
| if (spent >= max) { | ||
| throw new Error( | ||
| `FIDELITYBENCH_MAX_COST_USD exceeded before LLM call: $${spent.toFixed(4)} >= $${max.toFixed(2)}`, | ||
| ) |
There was a problem hiding this comment.
Enforce budget after recording each LLM call
assertLlmBudgetRemaining only checks spend before a call, so a run can overshoot FIDELITYBENCH_MAX_COST_USD on its final request and still be treated as valid because there is no post-call guard. This breaks the expected hard-cap behavior (and your prereg invalid-run rule) in cases where spent < cap before the call but spent + call_cost > cap afterward.
Useful? React with 👍 / 👎.
| for (let i = 0; i < Math.min(count, indices.length); i += 1) { | ||
| const turn = available[indices[i]!] | ||
| if (turn) selected.push(turn) |
There was a problem hiding this comment.
Fail fast when overflow sampling cannot supply N turns
The sampler silently truncates each pool with Math.min(count, indices.length), so for larger FIDELITYBENCH_OVERFLOW_N values it can return fewer than the requested N noise turns. buildTimeline then quietly emits a shorter timeline than configured, which changes scenario difficulty while still reporting it as “after N turns of noise,” making cross-run comparisons invalid for those settings.
Useful? React with 👍 / 👎.
No description provided.