fix(loop): never feed a no-signal judge result to the generator as a critique by agjs · Pull Request #58 · agjs/tsforge

agjs · 2026-06-28T20:29:09Z

Found by running real evals (you asked me to — and it paid off). A 3-seed sweep against the local DeepSeek-V4-Flash cluster passed the gate on all three, but fix-regression came back Q0/5. The cause wasn't the solution — it was a harness bug.

The bug

The quality loop judged the (green, correct) solution; the judge returned an unparseable response — a judge infrastructure failure, not a real score. The loop then handed the generator:

"The code is green but a senior reviewer scored it 0/5: "unparseable judge response". Improve the code to address that critique."

The model can't act on a non-critique. In the run.log it spiraled — pages of "the reviewer says it's broken but the code looks correct… maybe it's a system glitch… I'll make no changes" — and burned both quality attempts for zero gain. A pure friction sink, squarely against the lift-local-models north star.

The fix

IJudgeScore gains a scored flag — false when the judge produced no usable score (unparseable / errored).
qualityRepair returns early when the initial score is unscored: keep the green baseline, run no improvement attempt, never fabricate a critique. The pre-existing "no files in scope to judge" path is now scored:false too (it had the same latent bug).

Tests

judge: a parseable response → scored:true; an unparseable one → scored:false, overall:0, notes "unparseable".
quality: an unparseable judge → agent.implement is never called, attempts:0 (the regression this fixes).
bun run validate green (1825 tests).

Note

This is independent of the edit-on-autoformat work (#57) — different files (eval/judge.ts, loop/quality.ts). Separately, the judge being unparseable for that one seed at temp 0 is a robustness gap worth a follow-up (lenient judge-output parsing — the BAML idea from the roadmap), but the critical fix is to stop poisoning the loop.

gemini-code-assist

Code Review

This pull request introduces a scored flag to track whether a judge successfully produced a usable score, allowing the quality repair loop to be skipped when no valid signal is available. The feedback suggests ensuring that scored is also set to false when a parseable JSON response lacks a valid overall score, and adding a corresponding test case to verify this behavior.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…critique Found by running real evals: the quality loop judged a green fix-regression solution, the judge returned an unparseable response (judge infra failure, not a real 0/5), and the loop then told the model 'a senior reviewer scored it 0/5: "unparseable judge response". Improve the code to address that critique.' The model can't act on a non-critique — in the run.log it spiraled, reasoning in circles about whether its (correct) code was broken, and burned both quality attempts for zero gain. Fix: IJudgeScore gains a `scored` flag (false on unparseable/errored). qualityRepair returns early when the initial score is unscored (and the existing 'no files in scope' case is now scored:false too) — keep the green baseline, run no improvement attempt, never fabricate a critique. Tests cover both the judge flag and the skip-the-loop behavior.

…l (Gemini review) scored is now overall>0, not just 'JSON parsed' — a response missing/with an out-of-range overall (clamped to 0) is no usable signal either, so the quality loop skips it instead of acting on a fake 0/5. Tests for both cases.

Copilot

Pull request overview

Fixes a harness quality-loop failure mode where an unparseable (or otherwise no-signal) judge result was treated as an actionable “0/5 critique” and fed back into the generator, wasting quality attempts and causing spirals during eval runs.

Changes:

Add a scored: boolean flag to IJudgeScore to distinguish real scores from judge infra/no-signal outcomes.
Update qualityRepair to short-circuit (0 attempts) when the initial judge result is unscored, avoiding fabricated critiques.
Add regression tests covering unparseable judge output for both judge() and qualityRepair().

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
packages/core/src/eval/eval.types.ts	Extends `IJudgeScore` with `scored` to represent “usable vs no-signal” judge results.
packages/core/src/eval/judge.ts	Marks unparseable judge output as `scored:false` and parseable output as `scored:true`.
packages/core/src/loop/quality.ts	Skips the improvement loop entirely when the initial score is unscored (no usable judge signal).
packages/core/tests/judge.test.ts	Adds coverage asserting `scored:true` for parseable JSON and `scored:false` for unparseable responses.
packages/core/tests/quality.test.ts	Adds regression coverage ensuring unparseable judge output does not call `agent.implement` and yields `attempts:0`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ot review) provider.complete() can throw (non-2xx, timeout, connection); that bubbled out of the best-effort quality pass. Catch it and return a no-signal result (scored:false, notes 'judge call failed') so the caller skips the loop. Test added.

gemini-code-assist Bot reviewed Jun 28, 2026

View reviewed changes

Comment thread packages/core/src/eval/judge.ts Outdated

Comment thread packages/core/tests/judge.test.ts

agjs requested a review from Copilot June 28, 2026 20:31

Copilot started reviewing on behalf of agjs June 28, 2026 20:31 View session

agjs added 2 commits June 28, 2026 22:31

agjs force-pushed the fix/judge-no-signal branch from f8f06b7 to bdfe9fc Compare June 28, 2026 20:34

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread packages/core/src/eval/judge.ts

Comment thread packages/core/src/eval/judge.ts

agjs merged commit afde233 into main Jun 28, 2026
8 checks passed

agjs deleted the fix/judge-no-signal branch June 28, 2026 20:41

agjs mentioned this pull request Jun 29, 2026

Overnight harness-improvement run (eval-driven, WIP — do not merge until reviewed) #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(loop): never feed a no-signal judge result to the generator as a critique#58

fix(loop): never feed a no-signal judge result to the generator as a critique#58
agjs merged 3 commits into
mainfrom
fix/judge-no-signal

agjs commented Jun 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agjs commented Jun 28, 2026

The bug

The fix

Tests

Note

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants