Skip to content

FidelityBench: introduce LLM Judge, v2#1

Merged
avastu merged 7 commits into
mainfrom
overflow-v0.2
Apr 28, 2026
Merged

FidelityBench: introduce LLM Judge, v2#1
avastu merged 7 commits into
mainfrom
overflow-v0.2

Conversation

@avastu
Copy link
Copy Markdown
Owner

@avastu avastu commented Apr 28, 2026

No description provided.

avastu added 5 commits April 26, 2026 23:36
The clean Alex judge stored priorOutcomeOrigin via findFirstUserTurnMatching
but never used it — the comment "currently unused but kept for future" was
shipped. As a result the scenario claimed a 7-fact relational constellation
but actually scored 6.

Adds the 9th intent dimension: uses_prior_outcome. Detects whether the
draft mirrors the user's prior successful pushback pattern with Alex
("named the risk early" — explicitly on record as having worked).

Detection is lexical-floor only via PRIOR_OUTCOME_PATTERNS (8 positives /
0 false positives on hand-test fixtures). A future LLM-judge augmentation
can catch paraphrases that don't trip the regex.

No memory-laundering guard: there is no corresponding RecallBurdenCategory
for "prior outcome" yet. Documented inline so a future PR can add
prior_outcome to PATTERNS.

Rubric: maxScore 100->105 (+5 from new dim), maxIntentFidelity 40->45.
Architecture-discriminating overflow variant of alex_pushback_001.

Composition:
  - 7 load-bearing facts (6 originals + supersession event)
  - 80 noise turns from a Gemini-authored corpus (authorship separation
    per spec v2 §3.4; Gemini received pool definitions only, never the
    load-bearing facts themselves)
  - octile placement so windowed-transcript(K=20) misses 5 of 7
    load-bearing facts but still sees the recent constraint + supersession

Configuration:
  FIDELITYBENCH_OVERFLOW_N=80      total noise turns (default 80)
  FIDELITYBENCH_OVERFLOW_SEED=42   deterministic noise sampling + placement

New judge dimensions on top of clean Alex's 9:
  - attribution_fidelity (5pt): no Maya/Sarah/etc confusion with Alex
  - honors_latest_intent (5pt): reflects the post-launch fix-window update
                                 instead of zombie-intent Friday-risky framing

Rubric: 30 task + 55 intent (11 × 5) + 15 recall + 10 clar + 5 tools = 115.

Noise corpus audit (PASS):
  - 0 forbidden phrases
  - 0 (Alex, Friday/scope/launch/risk/timeline/unreliable) co-occurrences
  - 53% near-distractor density (target ≥25%)
  - 121 turns total across 9 pools

Spec reference: /tmp/alex_pushback_overflow_001.spec.v2.md
Adversarial review by codex CLI flagged three HIGH-severity scoring
bugs in v0.2 overflow + one MEDIUM in shared prior-outcome detection.
This commit applies all four; the v0.2 pilot already ran with the
unfixed scoring (results retained for comparison; see overnight status
report for caveats).

H1 + H2: honors_latest_intent now requires absence of zombie patterns
  Before: a draft saying "Friday is risky, push to Tuesday, mention a
  post-launch fix window" tripped both LATEST_INTENT_HONOR_PATTERNS and
  ZOMBIE_INTENT_PATTERNS but still received the dim because zombie was
  only noted when !honorsLatest. This rewarded mixed old/new framing.
  Now: honorsLatestHonored = honorsLatest && !zombieIntent && !askedConstraint.
  A new "MIXED INTENT" judge note is emitted for the both-tripped case.

H3: task_success now requires honors_latest_intent for the 30-bracket
  Before: an agent that ignored the supersession entirely could score
  30/30 on task by presenting only the superseded "scope-cut + Tuesday"
  framing, losing only 5 points on the latest-intent dim.
  Now: 30 requires honorsLatestHonored. Stale tradeoff drops to 20.
  Generic engagement drops to 15. No-engagement still 10.

M3: PRIOR_OUTCOME_PATTERNS tightened in BOTH clean and overflow files
  Three overly-generic patterns dropped: "before (we) commit",
  "better to flag/raise/...", "worth flagging". A normal pushback draft
  was earning the dim without any memory of Alex's prior good response.
  Remaining patterns require explicit early/risk-naming framing OR an
  intent-to-name verb anchored to risk/concern/timeline.

Spec ref: /tmp/codex-findings-v0.2.md
Codex's adversarial review identified that honors_latest_intent is
trivially gameable by lexical regex: a draft saying "I'm concerned
about the post-launch fix window approach" passes the dim because the
phrase "post-launch fix window" appears, even though the draft is
explicitly rejecting the supersession.

The pilot confirmed this. The hybrid agent's draft scored 110/115
under v0.2 lexical scoring, but the actual draft says "I cannot commit
to Friday. Safer path: Tuesday gives us full scope" — clearly NOT
honoring the supersession. The lexical regex couldn't tell.

This commit adds a strict-by-design LLM judge augmentation:

1. New ScenarioAsyncJudge type in src/types.ts. Optional async
   post-processing of a sync judge's result. Safety property: the
   async judge can DOWNGRADE a regex-honored dim to fail, but cannot
   upgrade a regex-failed dim to honored. That keeps paraphrase-
   tolerance from laundering credit.

2. New module src/judges/honorsLatestLLMJudge.ts. Calls the configured
   LLM provider (Bedrock Sonnet 4.5 by default). System prompt
   includes 5 hand-labeled examples (2 honored, 3 not honored)
   covering the lexical-pass-but-semantically-zombie case. Returns
   {honors: bool, evidence: string}.

3. Runner awaits asyncJudge if defined. Errors gracefully fall back
   to the sync result with a "[asyncJudge skipped]" note.

4. Overflow scenario exports asyncJudge that re-evaluates only the
   honors_latest_intent dim, only when lexical regex already passed.
   On LLM downgrade: dim flips to false, intent_fidelity recomputed,
   task_success recomputed under v0.2-rc rules (30-bracket requires
   honors_latest), total recomputed. Adds "LLM JUDGE DOWNGRADE" note
   for audit.

Validation (3 hand-test calls):
- Hybrid agent's actual pilot draft → DOWNGRADED ✓ ("recommends Tuesday
  despite the superseding update")
- Positive control (genuine fix-window endorsement) → HONORED ✓
- Strong negative control (Tuesday safer / Friday risky) → NOT HONORED ✓

Disable with FIDELITYBENCH_DISABLE_LLM_JUDGE=1 if you want to compare
lexical-only vs LLM-augmented scoring on the same trials.

Cost per trial: ~$0.005 (one judge call when lexical passed).
Copilot AI review requested due to automatic review settings April 28, 2026 07:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an optional “async judge” layer to augment existing regex-based judges with an LLM-based verification step (downgrade-only), adds LLM usage/cost tracking + budgeting, and ships a new Alex overflow scenario plus related artifacts/docs.

Changes:

  • Add ScenarioAsyncJudge + runner support to await best-effort async judge augmentation with downgrade-only enforcement.
  • Add LLM usage recording, cost estimation, and a max-cost budget guard for LLM calls; emit usage in CLI output / JSON artifacts.
  • Add an LLM judge for honors_latest_intent and a new alex_pushback_overflow_001 scenario using it, plus golden checks and exploratory result artifacts/docs.

Reviewed changes

Copilot reviewed 24 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/types.ts Adds ScenarioAsyncJudge and asyncJudgeResults to support async/LLM judge augmentation and auditing.
src/runner.ts Awaits optional asyncJudge and enforces “downgrade-only” semantics (with fallback-on-error behavior).
src/llm/usage.ts New usage/cost tracking module, including budget cap enforcement.
src/llm/client.ts Adds labeling + usage recording hooks and budget check before calls.
src/judges/honorsLatestLLMJudge.ts New LLM-based verifier + parser for honors_latest_intent semantic validation.
src/index.ts Resets usage per run; prints and emits llmUsage; adds optional loading of overflow scenario.
src/golden.ts Adds lightweight golden checks for verdict parsing and the overflow judge.
src/agents/WindowedTranscriptLLMAgent.ts Adds per-call labels for usage attribution.
src/agents/TranscriptLLMAgent.ts Adds per-call labels for usage attribution.
src/agents/StatelessLLMAgent.ts Adds per-call labels for usage attribution.
src/agents/GraphMemoryLLMAgent.ts Adds per-call labels for usage attribution (extract/respond).
src/agents/FileMemoryLLMAgent.ts Adds per-call labels for usage attribution (memory_update/respond).
src/agents/BlockMemoryLLMAgent.ts Adds per-call labels for usage attribution (extract/respond).
scenarios/data/alex_pushback_overflow_001.noise.json Adds the noise corpus used to construct the overflow timeline.
scenarios/alex_pushback_overflow_001.ts New overflow scenario, judge, and downgrade-only async LLM augmentation for honors_latest_intent.
scenarios/alex_pushback_001.ts Adds a new uses_prior_outcome dimension and updates scoring ceilings.
results/v0.3-exploratory-N20-windowed.json Adds exploratory run artifact including llmUsage records.
results/v0.3-exploratory-N20-hybrid.json Adds exploratory run artifact including llmUsage records.
results/v0.3-exploratory-N0-windowed.json Adds exploratory run artifact including llmUsage records.
results/v0.3-exploratory-N0-hybrid.json Adds exploratory run artifact including llmUsage records.
package.json Adds npm run golden script.
docs/scorecards/v0.3-exploratory-alex-overflow.md Documents exploratory smoke results, costs, and commands.
docs/prereg/alex-overflow-v0.3.md Adds prereg candidate protocol including LLM judge + budgeting requirements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/llm/usage.ts
Comment on lines +102 to +107
).map((entry) => ({
...entry,
estimatedCostUsd: Math.round(entry.estimatedCostUsd * 10000) / 10000,
})),
records,
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getLlmUsageSummary() returns the module-level records array directly. Callers can mutate it (e.g., summary.records.push(...) or summary.records.length = 0) which would corrupt usage tracking and budget checks. Consider returning a shallow copy (and/or frozen copy) of records instead.

Copilot uses AI. Check for mistakes.
Comment thread src/llm/client.ts
Comment on lines 15 to 21
export type LlmCallOptions = {
messages: LlmMessage[]
responseFormat?: "text" | "json_object"
temperature?: number
maxTokens?: number
label?: string
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LlmCallOptions includes responseFormat, but callLlm/callAnthropic/callOpenAi/callBedrock never use it when calling generateText. With the new LLM judge relying on strict JSON, this option being a no-op is easy to misinterpret. Either plumb responseFormat through to the underlying SDK (if supported) or remove/rename it to avoid a misleading API surface.

Copilot uses AI. Check for mistakes.
Comment thread src/runner.ts
Comment on lines +149 to +155
let augmented = result
if (asyncJudge) {
try {
augmented = enforceAsyncJudgeDowngradeOnly(
result,
await asyncJudge(result),
)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncJudge is invoked with the same result object that is later used as the original baseline in enforceAsyncJudgeDowngradeOnly. If an async judge mutates the passed-in object (even accidentally) and returns it, the baseline will already be modified and the downgrade-only enforcement can be bypassed. Consider passing a deep-cloned (or frozen) copy into asyncJudge, and comparing against an immutable snapshot of the original result.

Copilot uses AI. Check for mistakes.
Comment thread src/runner.ts Outdated
Comment on lines +182 to +186
if (augmented[field] > original[field]) {
throw new Error(
`asyncJudge attempted to increase ${field}: ${original[field]} -> ${augmented[field]}`,
)
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforceAsyncJudgeDowngradeOnly only checks augmented[field] > original[field]. If an async judge returns NaN (or a non-finite number) for any score field, the comparison will be false and the invalid result will slip through, potentially breaking reporting/aggregation. Consider validating that both original and augmented score fields are finite numbers (and throwing if not) before comparing.

Suggested change
if (augmented[field] > original[field]) {
throw new Error(
`asyncJudge attempted to increase ${field}: ${original[field]} -> ${augmented[field]}`,
)
}
const originalValue = original[field]
const augmentedValue = augmented[field]
if (!Number.isFinite(originalValue)) {
throw new Error(
`original evaluation has invalid ${field}: ${String(originalValue)}`,
)
}
if (!Number.isFinite(augmentedValue)) {
throw new Error(
`asyncJudge produced invalid ${field}: ${String(augmentedValue)}`,
)
}
if (augmentedValue > originalValue) {
throw new Error(
`asyncJudge attempted to increase ${field}: ${originalValue} -> ${augmentedValue}`,
)
}

Copilot uses AI. Check for mistakes.
Comment thread src/llm/usage.ts Outdated
Comment on lines +114 to +125
export function assertLlmBudgetRemaining() {
const cap = process.env.FIDELITYBENCH_MAX_COST_USD
if (!cap) return
const max = Number.parseFloat(cap)
if (!Number.isFinite(max) || max <= 0) return
const spent = getLlmUsageSummary().estimatedCostUsd
if (spent >= max) {
throw new Error(
`FIDELITYBENCH_MAX_COST_USD exceeded before LLM call: $${spent.toFixed(4)} >= $${max.toFixed(2)}`,
)
}
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertLlmBudgetRemaining() uses getLlmUsageSummary().estimatedCostUsd, but getLlmUsageSummary rounds estimatedCostUsd to 4 decimals. This can undercount spent and allow the process to exceed FIDELITYBENCH_MAX_COST_USD by a non-trivial amount over many calls. For budget enforcement, compute spent from the raw records sum (unrounded), and only round for display/reporting.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 188b60dbc8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/llm/usage.ts Outdated
Comment on lines +119 to +123
const spent = getLlmUsageSummary().estimatedCostUsd
if (spent >= max) {
throw new Error(
`FIDELITYBENCH_MAX_COST_USD exceeded before LLM call: $${spent.toFixed(4)} >= $${max.toFixed(2)}`,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce budget after recording each LLM call

assertLlmBudgetRemaining only checks spend before a call, so a run can overshoot FIDELITYBENCH_MAX_COST_USD on its final request and still be treated as valid because there is no post-call guard. This breaks the expected hard-cap behavior (and your prereg invalid-run rule) in cases where spent < cap before the call but spent + call_cost > cap afterward.

Useful? React with 👍 / 👎.

Comment on lines +139 to +141
for (let i = 0; i < Math.min(count, indices.length); i += 1) {
const turn = available[indices[i]!]
if (turn) selected.push(turn)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail fast when overflow sampling cannot supply N turns

The sampler silently truncates each pool with Math.min(count, indices.length), so for larger FIDELITYBENCH_OVERFLOW_N values it can return fewer than the requested N noise turns. buildTimeline then quietly emits a shorter timeline than configured, which changes scenario difficulty while still reporting it as “after N turns of noise,” making cross-run comparisons invalid for those settings.

Useful? React with 👍 / 👎.

@avastu avastu merged commit 5a082c2 into main Apr 28, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants