FidelityBench: introduce LLM Judge, v2 by avastu · Pull Request #1 · avastu/FidelityBench

avastu · 2026-04-28T07:02:28Z

No description provided.

The clean Alex judge stored priorOutcomeOrigin via findFirstUserTurnMatching but never used it — the comment "currently unused but kept for future" was shipped. As a result the scenario claimed a 7-fact relational constellation but actually scored 6. Adds the 9th intent dimension: uses_prior_outcome. Detects whether the draft mirrors the user's prior successful pushback pattern with Alex ("named the risk early" — explicitly on record as having worked). Detection is lexical-floor only via PRIOR_OUTCOME_PATTERNS (8 positives / 0 false positives on hand-test fixtures). A future LLM-judge augmentation can catch paraphrases that don't trip the regex. No memory-laundering guard: there is no corresponding RecallBurdenCategory for "prior outcome" yet. Documented inline so a future PR can add prior_outcome to PATTERNS. Rubric: maxScore 100->105 (+5 from new dim), maxIntentFidelity 40->45.

Architecture-discriminating overflow variant of alex_pushback_001. Composition: - 7 load-bearing facts (6 originals + supersession event) - 80 noise turns from a Gemini-authored corpus (authorship separation per spec v2 §3.4; Gemini received pool definitions only, never the load-bearing facts themselves) - octile placement so windowed-transcript(K=20) misses 5 of 7 load-bearing facts but still sees the recent constraint + supersession Configuration: FIDELITYBENCH_OVERFLOW_N=80 total noise turns (default 80) FIDELITYBENCH_OVERFLOW_SEED=42 deterministic noise sampling + placement New judge dimensions on top of clean Alex's 9: - attribution_fidelity (5pt): no Maya/Sarah/etc confusion with Alex - honors_latest_intent (5pt): reflects the post-launch fix-window update instead of zombie-intent Friday-risky framing Rubric: 30 task + 55 intent (11 × 5) + 15 recall + 10 clar + 5 tools = 115. Noise corpus audit (PASS): - 0 forbidden phrases - 0 (Alex, Friday/scope/launch/risk/timeline/unreliable) co-occurrences - 53% near-distractor density (target ≥25%) - 121 turns total across 9 pools Spec reference: /tmp/alex_pushback_overflow_001.spec.v2.md

Adversarial review by codex CLI flagged three HIGH-severity scoring bugs in v0.2 overflow + one MEDIUM in shared prior-outcome detection. This commit applies all four; the v0.2 pilot already ran with the unfixed scoring (results retained for comparison; see overnight status report for caveats). H1 + H2: honors_latest_intent now requires absence of zombie patterns Before: a draft saying "Friday is risky, push to Tuesday, mention a post-launch fix window" tripped both LATEST_INTENT_HONOR_PATTERNS and ZOMBIE_INTENT_PATTERNS but still received the dim because zombie was only noted when !honorsLatest. This rewarded mixed old/new framing. Now: honorsLatestHonored = honorsLatest && !zombieIntent && !askedConstraint. A new "MIXED INTENT" judge note is emitted for the both-tripped case. H3: task_success now requires honors_latest_intent for the 30-bracket Before: an agent that ignored the supersession entirely could score 30/30 on task by presenting only the superseded "scope-cut + Tuesday" framing, losing only 5 points on the latest-intent dim. Now: 30 requires honorsLatestHonored. Stale tradeoff drops to 20. Generic engagement drops to 15. No-engagement still 10. M3: PRIOR_OUTCOME_PATTERNS tightened in BOTH clean and overflow files Three overly-generic patterns dropped: "before (we) commit", "better to flag/raise/...", "worth flagging". A normal pushback draft was earning the dim without any memory of Alex's prior good response. Remaining patterns require explicit early/risk-naming framing OR an intent-to-name verb anchored to risk/concern/timeline. Spec ref: /tmp/codex-findings-v0.2.md

Codex's adversarial review identified that honors_latest_intent is trivially gameable by lexical regex: a draft saying "I'm concerned about the post-launch fix window approach" passes the dim because the phrase "post-launch fix window" appears, even though the draft is explicitly rejecting the supersession. The pilot confirmed this. The hybrid agent's draft scored 110/115 under v0.2 lexical scoring, but the actual draft says "I cannot commit to Friday. Safer path: Tuesday gives us full scope" — clearly NOT honoring the supersession. The lexical regex couldn't tell. This commit adds a strict-by-design LLM judge augmentation: 1. New ScenarioAsyncJudge type in src/types.ts. Optional async post-processing of a sync judge's result. Safety property: the async judge can DOWNGRADE a regex-honored dim to fail, but cannot upgrade a regex-failed dim to honored. That keeps paraphrase- tolerance from laundering credit. 2. New module src/judges/honorsLatestLLMJudge.ts. Calls the configured LLM provider (Bedrock Sonnet 4.5 by default). System prompt includes 5 hand-labeled examples (2 honored, 3 not honored) covering the lexical-pass-but-semantically-zombie case. Returns {honors: bool, evidence: string}. 3. Runner awaits asyncJudge if defined. Errors gracefully fall back to the sync result with a "[asyncJudge skipped]" note. 4. Overflow scenario exports asyncJudge that re-evaluates only the honors_latest_intent dim, only when lexical regex already passed. On LLM downgrade: dim flips to false, intent_fidelity recomputed, task_success recomputed under v0.2-rc rules (30-bracket requires honors_latest), total recomputed. Adds "LLM JUDGE DOWNGRADE" note for audit. Validation (3 hand-test calls): - Hybrid agent's actual pilot draft → DOWNGRADED ✓ ("recommends Tuesday despite the superseding update") - Positive control (genuine fix-window endorsement) → HONORED ✓ - Strong negative control (Tuesday safer / Friday risky) → NOT HONORED ✓ Disable with FIDELITYBENCH_DISABLE_LLM_JUDGE=1 if you want to compare lexical-only vs LLM-augmented scoring on the same trials. Cost per trial: ~$0.005 (one judge call when lexical passed).

Copilot

Pull request overview

This PR introduces an optional “async judge” layer to augment existing regex-based judges with an LLM-based verification step (downgrade-only), adds LLM usage/cost tracking + budgeting, and ships a new Alex overflow scenario plus related artifacts/docs.

Changes:

Add ScenarioAsyncJudge + runner support to await best-effort async judge augmentation with downgrade-only enforcement.
Add LLM usage recording, cost estimation, and a max-cost budget guard for LLM calls; emit usage in CLI output / JSON artifacts.
Add an LLM judge for honors_latest_intent and a new alex_pushback_overflow_001 scenario using it, plus golden checks and exploratory result artifacts/docs.

Reviewed changes

Copilot reviewed 24 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/types.ts	Adds `ScenarioAsyncJudge` and `asyncJudgeResults` to support async/LLM judge augmentation and auditing.
src/runner.ts	Awaits optional `asyncJudge` and enforces “downgrade-only” semantics (with fallback-on-error behavior).
src/llm/usage.ts	New usage/cost tracking module, including budget cap enforcement.
src/llm/client.ts	Adds labeling + usage recording hooks and budget check before calls.
src/judges/honorsLatestLLMJudge.ts	New LLM-based verifier + parser for `honors_latest_intent` semantic validation.
src/index.ts	Resets usage per run; prints and emits `llmUsage`; adds optional loading of overflow scenario.
src/golden.ts	Adds lightweight golden checks for verdict parsing and the overflow judge.
src/agents/WindowedTranscriptLLMAgent.ts	Adds per-call labels for usage attribution.
src/agents/TranscriptLLMAgent.ts	Adds per-call labels for usage attribution.
src/agents/StatelessLLMAgent.ts	Adds per-call labels for usage attribution.
src/agents/GraphMemoryLLMAgent.ts	Adds per-call labels for usage attribution (extract/respond).
src/agents/FileMemoryLLMAgent.ts	Adds per-call labels for usage attribution (memory_update/respond).
src/agents/BlockMemoryLLMAgent.ts	Adds per-call labels for usage attribution (extract/respond).
scenarios/data/alex_pushback_overflow_001.noise.json	Adds the noise corpus used to construct the overflow timeline.
scenarios/alex_pushback_overflow_001.ts	New overflow scenario, judge, and downgrade-only async LLM augmentation for `honors_latest_intent`.
scenarios/alex_pushback_001.ts	Adds a new `uses_prior_outcome` dimension and updates scoring ceilings.
results/v0.3-exploratory-N20-windowed.json	Adds exploratory run artifact including `llmUsage` records.
results/v0.3-exploratory-N20-hybrid.json	Adds exploratory run artifact including `llmUsage` records.
results/v0.3-exploratory-N0-windowed.json	Adds exploratory run artifact including `llmUsage` records.
results/v0.3-exploratory-N0-hybrid.json	Adds exploratory run artifact including `llmUsage` records.
package.json	Adds `npm run golden` script.
docs/scorecards/v0.3-exploratory-alex-overflow.md	Documents exploratory smoke results, costs, and commands.
docs/prereg/alex-overflow-v0.3.md	Adds prereg candidate protocol including LLM judge + budgeting requirements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T07:06:40Z

+    ).map((entry) => ({
+      ...entry,
+      estimatedCostUsd: Math.round(entry.estimatedCostUsd * 10000) / 10000,
+    })),
+    records,
+  }


getLlmUsageSummary() returns the module-level records array directly. Callers can mutate it (e.g., summary.records.push(...) or summary.records.length = 0) which would corrupt usage tracking and budget checks. Consider returning a shallow copy (and/or frozen copy) of records instead.

Copilot · 2026-04-28T07:06:41Z

 export type LlmCallOptions = {
  messages: LlmMessage[]
  responseFormat?: "text" | "json_object"
  temperature?: number
  maxTokens?: number
+  label?: string
 }


LlmCallOptions includes responseFormat, but callLlm/callAnthropic/callOpenAi/callBedrock never use it when calling generateText. With the new LLM judge relying on strict JSON, this option being a no-op is easy to misinterpret. Either plumb responseFormat through to the underlying SDK (if supported) or remove/rename it to avoid a misleading API surface.

Copilot · 2026-04-28T07:06:41Z

+  let augmented = result
+  if (asyncJudge) {
+    try {
+      augmented = enforceAsyncJudgeDowngradeOnly(
+        result,
+        await asyncJudge(result),
+      )


asyncJudge is invoked with the same result object that is later used as the original baseline in enforceAsyncJudgeDowngradeOnly. If an async judge mutates the passed-in object (even accidentally) and returns it, the baseline will already be modified and the downgrade-only enforcement can be bypassed. Consider passing a deep-cloned (or frozen) copy into asyncJudge, and comparing against an immutable snapshot of the original result.

Copilot · 2026-04-28T07:06:41Z

+    if (augmented[field] > original[field]) {
+      throw new Error(
+        `asyncJudge attempted to increase ${field}: ${original[field]} -> ${augmented[field]}`,
+      )
+    }


enforceAsyncJudgeDowngradeOnly only checks augmented[field] > original[field]. If an async judge returns NaN (or a non-finite number) for any score field, the comparison will be false and the invalid result will slip through, potentially breaking reporting/aggregation. Consider validating that both original and augmented score fields are finite numbers (and throwing if not) before comparing.

Suggested change

if (augmented[field] > original[field]) {

throw new Error(

`asyncJudge attempted to increase ${field}: ${original[field]} -> ${augmented[field]}`,

)

}

const originalValue = original[field]

const augmentedValue = augmented[field]

if (!Number.isFinite(originalValue)) {

throw new Error(

`original evaluation has invalid ${field}: ${String(originalValue)}`,

)

}

if (!Number.isFinite(augmentedValue)) {

throw new Error(

`asyncJudge produced invalid ${field}: ${String(augmentedValue)}`,

)

}

if (augmentedValue > originalValue) {

throw new Error(

`asyncJudge attempted to increase ${field}: ${originalValue} -> ${augmentedValue}`,

)

}

Copilot · 2026-04-28T07:06:41Z

+export function assertLlmBudgetRemaining() {
+  const cap = process.env.FIDELITYBENCH_MAX_COST_USD
+  if (!cap) return
+  const max = Number.parseFloat(cap)
+  if (!Number.isFinite(max) || max <= 0) return
+  const spent = getLlmUsageSummary().estimatedCostUsd
+  if (spent >= max) {
+    throw new Error(
+      `FIDELITYBENCH_MAX_COST_USD exceeded before LLM call: $${spent.toFixed(4)} >= $${max.toFixed(2)}`,
+    )
+  }
+}


assertLlmBudgetRemaining() uses getLlmUsageSummary().estimatedCostUsd, but getLlmUsageSummary rounds estimatedCostUsd to 4 decimals. This can undercount spent and allow the process to exceed FIDELITYBENCH_MAX_COST_USD by a non-trivial amount over many calls. For budget enforcement, compute spent from the raw records sum (unrounded), and only round for display/reporting.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 188b60dbc8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T07:08:03Z

+  const spent = getLlmUsageSummary().estimatedCostUsd
+  if (spent >= max) {
+    throw new Error(
+      `FIDELITYBENCH_MAX_COST_USD exceeded before LLM call: $${spent.toFixed(4)} >= $${max.toFixed(2)}`,
+    )


Enforce budget after recording each LLM call

assertLlmBudgetRemaining only checks spend before a call, so a run can overshoot FIDELITYBENCH_MAX_COST_USD on its final request and still be treated as valid because there is no post-call guard. This breaks the expected hard-cap behavior (and your prereg invalid-run rule) in cases where spent < cap before the call but spent + call_cost > cap afterward.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-28T07:08:03Z

+    for (let i = 0; i < Math.min(count, indices.length); i += 1) {
+      const turn = available[indices[i]!]
+      if (turn) selected.push(turn)


Fail fast when overflow sampling cannot supply N turns

The sampler silently truncates each pool with Math.min(count, indices.length), so for larger FIDELITYBENCH_OVERFLOW_N values it can return fewer than the requested N noise turns. buildTimeline then quietly emits a shorter timeline than configured, which changes scenario difficulty while still reporting it as “after N turns of noise,” making cross-run comparisons invalid for those settings.

Useful? React with 👍 / 👎.

avastu added 5 commits April 26, 2026 23:36

Harden Alex overflow eval and add v0.3 exploratory scorecard

188b60d

Copilot AI review requested due to automatic review settings April 28, 2026 07:02

Copilot started reviewing on behalf of avastu April 28, 2026 07:02 View session

Document v0.3 eval hardening in README

353c263

Copilot AI reviewed Apr 28, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

Address PR review feedback on eval guards

5909180

avastu merged commit 5a082c2 into main Apr 28, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FidelityBench: introduce LLM Judge, v2#1

FidelityBench: introduce LLM Judge, v2#1
avastu merged 7 commits into
mainfrom
overflow-v0.2

avastu commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

Copilot AI Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    if (augmented[field] > original[field]) {
-      throw new Error(
-        `asyncJudge attempted to increase ${field}: ${original[field]} -> ${augmented[field]}`,
-      )
-    }
+    const originalValue = original[field]
+    const augmentedValue = augmented[field]
+    if (!Number.isFinite(originalValue)) {
+      throw new Error(
+        `original evaluation has invalid ${field}: ${String(originalValue)}`,
+      )
+    }
+    if (!Number.isFinite(augmentedValue)) {
+      throw new Error(
+        `asyncJudge produced invalid ${field}: ${String(augmentedValue)}`,
+      )
+    }
+    if (augmentedValue > originalValue) {
+      throw new Error(
+        `asyncJudge attempted to increase ${field}: ${originalValue} -> ${augmentedValue}`,
+      )
+    }

Conversation

avastu commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants