Skip to content

feat: loop evolution — accept-rate metrics, post-green review-repair, greenfield multi-role harness#44

Merged
agjs merged 8 commits into
mainfrom
feat/loop-evolution
Jun 22, 2026
Merged

feat: loop evolution — accept-rate metrics, post-green review-repair, greenfield multi-role harness#44
agjs merged 8 commits into
mainfrom
feat/loop-evolution

Conversation

@agjs

@agjs agjs commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Why

Studied Anthropic's long-running-agent workshop + the Kopadze "loops" model and reviewed a Cursor-drafted roadmap. A codebase audit showed tsforge already implements most of that plan (deterministic gate + exit-code oracle, JSONL ledger + /trace, adversarial review find→verify, quality snapshot/revert, TTSR memory, browser oracle, two-phase buildStaged). So this PR builds only the genuinely missing, high-leverage pieces — and encodes the workshop's design rules as ratchets, not comments.

Five design rules adopted: verifier-is-the-heart, evaluator-blind-to-generator-traces (now test-enforced), filesystem-state > context, planner-stays-high-level, harness-co-evolves (every new layer flag/knob-gated).

What

Stage 1 — accept-rate metrics + post-green review-repair (6eca962)

  • New reverted loop event → edit_reverted ledger type; quality.ts emits it on both rollback paths. analyzeEvents derives editsReverted / acceptRate / costPerAcceptedChange, surfaced in /trace. (The data was already there — reverts just weren't reported.)
  • withReview recipe knob / --with-review: after a one-shot run goes green, run the adversarial review and feed verified findings into one repair cycle, reverting it if it breaks the gate. New loop/review-repair.ts.
  • Extracted shared loop/file-snapshot.ts (snapshot/restore) used by both quality- and review-repair.

Stage 2 — greenfield filesystem-state outer loop (ba08dca)

  • loop/greenfield/: runGreenfield drives a .tsforge/greenfield/features.json checklist to all-green one feature at a time, persisting after every step (resumable; per-feature stuck guard so one feature can't wedge the build).
  • evaluateFeature = layered gate → browser(renderCheck) → judge, short-circuiting. Gate stays authority; browser is skip-tolerant; judge sees only the built artifact.
  • Recipe mode: "greenfield" knob — orchestration lives in a new loop entry, not in recipes (they stay declarative data).

Stage 3 — planner + reject-by-default judge + model routing + trace-blindness ratchet (c2e0383)

  • Planner turns a one-line goal into spec + checklist; reject-by-default feature judge (mirrors review's VERIFY_SYSTEM, fail-closed).
  • resolveModelByName + recipe plannerModel/workModel/evaluatorModel → per-role models, falling back to the active model (single-endpoint setups still work).
  • JUDGE_INPUT_SHAPE: Record<keyof IJudgeInput, true> + test: adding any field to the judge input is a compile error until it's listed, where the test rejects trace-ish names. Design-rule Feature/guardrail packs local uplift #2 can't silently regress.
  • prepareState (resume-first, else plan) + greenfieldMode CLI entry + --greenfield flag / recipe dispatch.

Stage 4 — experimental contract negotiation + scheduling (5287648)

  • loop/greenfield/contract.ts: TSFORGE_CONTRACT-gated (off by default) propose↔object negotiation; the evaluator sees only the proposal + feature (rule Feature/guardrail packs local uplift #2). Persisted to .tsforge/greenfield/contracts/<feature>.md. The workshop itself flagged this as unproven, so it's opt-in.
  • --notify <cmd>: best-effort shell ping on greenfield completion, outcome in $TSFORGE_STATUS (cron/unattended runs).

Docs (5287648, 1d89cb1) — loop/greenfield.mdx + updates to reference/commands, cli/recipes, observability/trace, reference/flags.

Notable scope decisions

  • Did not rebuild what exists (no "breadcrumbs" file — the ledger already records the failure gradient; evaluator reuses gate+oracle+judge; contract seeds off buildStaged).
  • The greenfield CLI implement uses runTask per feature against the global gate; a Session.send-based implement would be more precise but needs live-model testing. Noted as a follow-up.

Verification

  • bun run validate green — 1420 tests, 0 fail.
  • Every new regression test was flip-and-confirm-failed (incl. the trace-blindness ratchet, the revert logic, and the per-feature stuck guard).
  • astro build clean — 44 pages.
  • The full greenfield end-to-end path needs a live model + Playwright; its orchestration core is fully unit-tested with mocks.

agjs added 5 commits June 22, 2026 12:25
…ecipe

Stage 1 of loop-evolution:
- 1A: emit a reverted accounting event from quality-repair rollbacks, map it
  to an edit_reverted ledger type, derive editsReverted / acceptRate /
  costPerAcceptedChange in analyzeEvents + /trace.
- 1B: withReview recipe knob + --with-review flag: after a one-shot run goes
  green, run reviewChange and feed verified findings into ONE repair cycle
  (reverts if it breaks the gate).
- Extract shared snapshot/restore to loop/file-snapshot.ts.
…ator

Stage 2 of loop-evolution:
- loop/greenfield/: runGreenfield drives a features.json checklist to
  all-green one feature at a time, persisting state after every step
  (resumable, per-feature stuck guard so one feature can't wedge the run).
- evaluateFeature: layered gate -> browser(renderCheck) -> judge, short-
  circuiting; gate stays authority, browser skip-tolerant, judge sees only
  the built artifact (design-rule #2). All layers injected (testable).
- state.ts: features.json/spec.md/progress.md read/write + pure renderProgress.
- recipe knob mode:'greenfield' selects the outer loop (orchestration stays
  out of recipes - they remain declarative data).
… trace-blindness ratchet

Stage 3 of loop-evolution:
- greenfield/plan.ts: planner role turns a one-line goal into a high-level
  spec.md + feature checklist (parsePlan pure/tested).
- greenfield/judge.ts: harsh reject-by-default feature judge (mirrors
  review-change VERIFY_SYSTEM; fail-closed on unparseable).
- resolveModelByName + recipe plannerModel/workModel/evaluatorModel: route
  each greenfield role to its own model, falling back to active (single
  endpoint still works). Evaluator stays trace-blind regardless.
- JUDGE_INPUT_SHAPE: Record<keyof IJudgeInput,true> ratchet + test enforcing
  design-rule #2 (evaluator never sees the generator's trace).
- prepareState (resume-first, else plan) + greenfieldMode CLI entry +
  --greenfield flag / recipe mode dispatch.
…ify + docs

Stage 4 of loop-evolution:
- greenfield/contract.ts: TSFORGE_CONTRACT-gated (off by default) propose<->object
  negotiation; generator proposes a build contract, evaluator pushes back to
  'agreed' or maxRounds. Evaluator sees only the proposal+feature (rule #2).
  Persisted to .tsforge/greenfield/contracts/<feature>.md. Wired into the
  greenfield implement step behind the flag.
- --notify <cmd>: best-effort shell ping on greenfield completion with the
  outcome in $TSFORGE_STATUS (for cron/unattended runs).
- docs: loop/greenfield.mdx (greenfield mode, model routing, contract flag,
  scheduling/cron + --notify) + sidebar entry.

Full `bun run validate` green (1420 tests, 0 fail).
- commands.mdx: greenfield build section, --with-review one-shot note,
  accept-rate/cost in the trace summary line.
- cli/recipes.mdx: withReview, mode, planner/work/evaluatorModel fields.
- observability/trace.mdx: accept rate + cost/accepted in the example + table.
- reference/flags.mdx: TSFORGE_CONTRACT.
Docs site builds clean (44 pages).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces 'Greenfield builds' to tsforge, enabling feature-by-feature application development driven by a filesystem-tracked checklist, role-based model routing, and a layered evaluator (gate, browser, and judge). It also adds post-green review-and-repair capabilities, CLI flags like --greenfield and --with-review, and metrics for tracking edit accept rates. The review feedback highlights several critical improvements: expanding the default */ file scope in snapshotFiles and scopeCode to prevent literal path errors, passing the previous contract during negotiation rounds so the generator can make informed revisions, adding defensive checks on browser.errors to avoid runtime TypeErrors, and wrapping the review-repair loop in a try...catch block to guarantee workspace rollback on failure.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/core/src/loop/file-snapshot.ts
Comment thread packages/core/src/cli.ts
Comment thread packages/core/src/loop/greenfield/contract.ts
Comment thread packages/core/src/loop/greenfield/contract.ts
Comment thread packages/core/src/loop/greenfield/evaluate.ts
Comment thread packages/core/src/loop/review-repair.ts
agjs added 3 commits June 22, 2026 14:07
…contract revision context

Address PR #44 review:
- snapshotFiles: expand globs via resolveScopeFiles before snapshotting. The
  whole-repo default scope ['**/*'] was snapshotted literally (never exists),
  so review-repair's revert was a silent no-op by default. (file-snapshot.test.ts)
- reviewRepair: wrap implement+fix+gate in try/catch — a throw mid-repair now
  rolls the workspace back and rethrows, instead of leaving a half-applied edit.
- negotiateContract: pass the generator its OWN previous proposal on a revision
  round (the stateless call couldn't revise without it).
Not changed (false positives): scopeCode '**/*' (readFiles already glob-expands);
browser.errors?. (errors is a required string[] on IRenderResult).
[P1] greenfield implement no-op on green gate: runTask is RED-first and returns
  before the model when the gate already passes. Add IRunOptions.requireRed
  (default true); greenfield passes false. Extracted redPrecheck() helper.
[P1] rollback misses created files + 400-file cap: file-snapshot records the
  pre-existing path SET (uncapped) + contents; restore tombstones files that
  appeared after the snapshot. resolveScopeFiles/expandGlob take a limit.
[P2] planner/evaluator fall back to args.model (not just workModel).
[P2] greenfield per-feature runTask honours maxTurns + thinkingBudget.
[P3] contract prefix no longer claims 'Agreed' when negotiation didn't converge.
… setup`

runWizard used emitKeypressEvents + a keypress listener but never enabled raw
mode — it assumed an already-raw caller (the REPL's readline, for /setup).
Standalone `tsforge setup` runs with cooked stdin, so arrow keys echoed as raw
^[[A/^[[B and never moved the selection. runWizard now enables raw mode (and
resumes stdin) when there were no prior keypress listeners (i.e. nobody else
owns stdin), restoring it on exit; the REPL /setup path (raw already on, with
saved listeners) is left untouched.
@agjs agjs merged commit c377869 into main Jun 22, 2026
8 checks passed
@agjs agjs deleted the feat/loop-evolution branch June 22, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant