fix(loop): never feed a no-signal judge result to the generator as a critique#58
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a scored flag to track whether a judge successfully produced a usable score, allowing the quality repair loop to be skipped when no valid signal is available. The feedback suggests ensuring that scored is also set to false when a parseable JSON response lacks a valid overall score, and adding a corresponding test case to verify this behavior.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…critique Found by running real evals: the quality loop judged a green fix-regression solution, the judge returned an unparseable response (judge infra failure, not a real 0/5), and the loop then told the model 'a senior reviewer scored it 0/5: "unparseable judge response". Improve the code to address that critique.' The model can't act on a non-critique — in the run.log it spiraled, reasoning in circles about whether its (correct) code was broken, and burned both quality attempts for zero gain. Fix: IJudgeScore gains a `scored` flag (false on unparseable/errored). qualityRepair returns early when the initial score is unscored (and the existing 'no files in scope' case is now scored:false too) — keep the green baseline, run no improvement attempt, never fabricate a critique. Tests cover both the judge flag and the skip-the-loop behavior.
…l (Gemini review) scored is now overall>0, not just 'JSON parsed' — a response missing/with an out-of-range overall (clamped to 0) is no usable signal either, so the quality loop skips it instead of acting on a fake 0/5. Tests for both cases.
f8f06b7 to
bdfe9fc
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a harness quality-loop failure mode where an unparseable (or otherwise no-signal) judge result was treated as an actionable “0/5 critique” and fed back into the generator, wasting quality attempts and causing spirals during eval runs.
Changes:
- Add a
scored: booleanflag toIJudgeScoreto distinguish real scores from judge infra/no-signal outcomes. - Update
qualityRepairto short-circuit (0 attempts) when the initial judge result is unscored, avoiding fabricated critiques. - Add regression tests covering unparseable judge output for both
judge()andqualityRepair().
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/eval/eval.types.ts | Extends IJudgeScore with scored to represent “usable vs no-signal” judge results. |
| packages/core/src/eval/judge.ts | Marks unparseable judge output as scored:false and parseable output as scored:true. |
| packages/core/src/loop/quality.ts | Skips the improvement loop entirely when the initial score is unscored (no usable judge signal). |
| packages/core/tests/judge.test.ts | Adds coverage asserting scored:true for parseable JSON and scored:false for unparseable responses. |
| packages/core/tests/quality.test.ts | Adds regression coverage ensuring unparseable judge output does not call agent.implement and yields attempts:0. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ot review) provider.complete() can throw (non-2xx, timeout, connection); that bubbled out of the best-effort quality pass. Catch it and return a no-signal result (scored:false, notes 'judge call failed') so the caller skips the loop. Test added.
Found by running real evals (you asked me to — and it paid off). A 3-seed sweep against the local DeepSeek-V4-Flash cluster passed the gate on all three, but
fix-regressioncame back Q0/5. The cause wasn't the solution — it was a harness bug.The bug
The quality loop judged the (green, correct) solution; the judge returned an unparseable response — a judge infrastructure failure, not a real score. The loop then handed the generator:
The model can't act on a non-critique. In the run.log it spiraled — pages of "the reviewer says it's broken but the code looks correct… maybe it's a system glitch… I'll make no changes" — and burned both quality attempts for zero gain. A pure friction sink, squarely against the lift-local-models north star.
The fix
IJudgeScoregains ascoredflag —falsewhen the judge produced no usable score (unparseable / errored).qualityRepairreturns early when the initial score is unscored: keep the green baseline, run no improvement attempt, never fabricate a critique. The pre-existing "no files in scope to judge" path is nowscored:falsetoo (it had the same latent bug).Tests
judge: a parseable response →scored:true; an unparseable one →scored:false,overall:0, notes "unparseable".quality: an unparseable judge →agent.implementis never called,attempts:0(the regression this fixes).bun run validategreen (1825 tests).Note
This is independent of the edit-on-autoformat work (#57) — different files (
eval/judge.ts,loop/quality.ts). Separately, the judge being unparseable for that one seed at temp 0 is a robustness gap worth a follow-up (lenient judge-output parsing — the BAML idea from the roadmap), but the critical fix is to stop poisoning the loop.