Skip to content

fix(loop): never feed a no-signal judge result to the generator as a critique#58

Merged
agjs merged 3 commits into
mainfrom
fix/judge-no-signal
Jun 28, 2026
Merged

fix(loop): never feed a no-signal judge result to the generator as a critique#58
agjs merged 3 commits into
mainfrom
fix/judge-no-signal

Conversation

@agjs

@agjs agjs commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Found by running real evals (you asked me to — and it paid off). A 3-seed sweep against the local DeepSeek-V4-Flash cluster passed the gate on all three, but fix-regression came back Q0/5. The cause wasn't the solution — it was a harness bug.

The bug

The quality loop judged the (green, correct) solution; the judge returned an unparseable response — a judge infrastructure failure, not a real score. The loop then handed the generator:

"The code is green but a senior reviewer scored it 0/5: "unparseable judge response". Improve the code to address that critique."

The model can't act on a non-critique. In the run.log it spiraled — pages of "the reviewer says it's broken but the code looks correct… maybe it's a system glitch… I'll make no changes" — and burned both quality attempts for zero gain. A pure friction sink, squarely against the lift-local-models north star.

The fix

  • IJudgeScore gains a scored flag — false when the judge produced no usable score (unparseable / errored).
  • qualityRepair returns early when the initial score is unscored: keep the green baseline, run no improvement attempt, never fabricate a critique. The pre-existing "no files in scope to judge" path is now scored:false too (it had the same latent bug).

Tests

  • judge: a parseable response → scored:true; an unparseable one → scored:false, overall:0, notes "unparseable".
  • quality: an unparseable judge → agent.implement is never called, attempts:0 (the regression this fixes).
  • bun run validate green (1825 tests).

Note

This is independent of the edit-on-autoformat work (#57) — different files (eval/judge.ts, loop/quality.ts). Separately, the judge being unparseable for that one seed at temp 0 is a robustness gap worth a follow-up (lenient judge-output parsing — the BAML idea from the roadmap), but the critical fix is to stop poisoning the loop.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a scored flag to track whether a judge successfully produced a usable score, allowing the quality repair loop to be skipped when no valid signal is available. The feedback suggests ensuring that scored is also set to false when a parseable JSON response lacks a valid overall score, and adding a corresponding test case to verify this behavior.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/core/src/eval/judge.ts Outdated
Comment thread packages/core/tests/judge.test.ts
agjs added 2 commits June 28, 2026 22:31
…critique

Found by running real evals: the quality loop judged a green fix-regression
solution, the judge returned an unparseable response (judge infra failure, not a
real 0/5), and the loop then told the model 'a senior reviewer scored it 0/5:
"unparseable judge response". Improve the code to address that critique.' The
model can't act on a non-critique — in the run.log it spiraled, reasoning in
circles about whether its (correct) code was broken, and burned both quality
attempts for zero gain.

Fix: IJudgeScore gains a `scored` flag (false on unparseable/errored). qualityRepair
returns early when the initial score is unscored (and the existing 'no files in
scope' case is now scored:false too) — keep the green baseline, run no improvement
attempt, never fabricate a critique. Tests cover both the judge flag and the
skip-the-loop behavior.
…l (Gemini review)

scored is now overall>0, not just 'JSON parsed' — a response missing/with an
out-of-range overall (clamped to 0) is no usable signal either, so the quality
loop skips it instead of acting on a fake 0/5. Tests for both cases.
@agjs agjs force-pushed the fix/judge-no-signal branch from f8f06b7 to bdfe9fc Compare June 28, 2026 20:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a harness quality-loop failure mode where an unparseable (or otherwise no-signal) judge result was treated as an actionable “0/5 critique” and fed back into the generator, wasting quality attempts and causing spirals during eval runs.

Changes:

  • Add a scored: boolean flag to IJudgeScore to distinguish real scores from judge infra/no-signal outcomes.
  • Update qualityRepair to short-circuit (0 attempts) when the initial judge result is unscored, avoiding fabricated critiques.
  • Add regression tests covering unparseable judge output for both judge() and qualityRepair().

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/core/src/eval/eval.types.ts Extends IJudgeScore with scored to represent “usable vs no-signal” judge results.
packages/core/src/eval/judge.ts Marks unparseable judge output as scored:false and parseable output as scored:true.
packages/core/src/loop/quality.ts Skips the improvement loop entirely when the initial score is unscored (no usable judge signal).
packages/core/tests/judge.test.ts Adds coverage asserting scored:true for parseable JSON and scored:false for unparseable responses.
packages/core/tests/quality.test.ts Adds regression coverage ensuring unparseable judge output does not call agent.implement and yields attempts:0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/core/src/eval/judge.ts
Comment thread packages/core/src/eval/judge.ts
…ot review)

provider.complete() can throw (non-2xx, timeout, connection); that bubbled out of
the best-effort quality pass. Catch it and return a no-signal result (scored:false,
notes 'judge call failed') so the caller skips the loop. Test added.
@agjs agjs merged commit afde233 into main Jun 28, 2026
8 checks passed
@agjs agjs deleted the fix/judge-no-signal branch June 28, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants