Skip to content

test(e2e): replicate the real user experience across subsystems#53

Merged
agjs merged 7 commits into
mainfrom
feat/e2e-harness
Jun 27, 2026
Merged

test(e2e): replicate the real user experience across subsystems#53
agjs merged 7 commits into
mainfrom
feat/e2e-harness

Conversation

@agjs

@agjs agjs commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Generalizes the editor's VirtualScreen win so the real user experience is replayable in tests — asserting observable behavior (rendered screen, tool runs, file changes, gate verdicts), not just logical state. Goal: catch breakage in CI before a human hits it by hand.

What's covered

Reusable harnesses

  • ScriptedModel (tests/helpers/scripted-model.ts) — a deterministic IProvider driving the REAL agent loop from scripted turns; a turn can be a function of the conversation so far, so it reacts to gate feedback / tool results.
  • runScriptedSession (tests/helpers/session-harness.ts) — wires it to a real Session (real tools, real gate, temp cwd), captures the ILoopEvent stream + file changes.
  • VirtualScreen (shipped in v0.24.0) — renders emitted bytes to a screen grid; reused here for every TTY-render assertion.

Tier 1 — agent session loop (12 tests)

Conversational turn, create→disk, passing gate→done, failing gate→repair→green→done, out-of-scope rejection, run-tool shell exec, edit-tool snippet replace, read round-trip (tool result reaches the model), multi-file build, plan-mode write rejection (the read-only guarantee), plan-mode read allowed, auto-fix runs before re-gate.

Tier 3 — interactive TTY overlays (11 tests)

Wizard rendered at each step (title, cursor gutter, checkbox toggle, overview), command palette (filtered list + selection), @-file picker dropdown, and overlay-shrink → no ghost rows (the same bug class as the editor, now guarded for the picker). These close the render blind spot: the existing wizard tests assert reducer STATE only.

Findings (no duplication added)

  • Scaffolder (Tier 2) — already comprehensively covered by scaffold-run.test.ts (full runScaffold: clone+configure+boot+gate, skipBoot, invalid-config refusal, manifest-source-of-truth, astro). No gap.
  • Review (Tier 4) — already covered: review-change.test.ts drives reviewChange with stub providers (find/verify passes, non-JSON). No gap.

The two genuine gaps were the session-loop e2e and the TTY-render e2e — both filled here.

Stacked off main after v0.24.0.

agjs added 2 commits June 27, 2026 12:25
Foundation for replicating the real user experience in tests, reusable by every
later flow. ScriptedModel is a deterministic IProvider that drives the REAL agent
loop from a sequence of turns; runScriptedSession wires it to a real Session with
real tools + real gate over a temp cwd and captures the observable event stream
(ILoopEvent) + file changes.

6 Tier-1 tests prove the loop end-to-end: conversational turn (no changes),
create-tool → file on disk, passing gate → done, failing gate → repair → green →
done, out-of-scope create rejected, run tool executes a shell command. No live
LLM, fully deterministic.
3 more session e2e tests: the edit tool replaces a snippet in a seeded file; a
read tool result flows back into the conversation (the reactive turn copies the
read value into a new file, proving the tool-result→model loop); a multi-file
build creates every file in one session. Reuses the scripted-model harness.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a deterministic end-to-end testing harness for the agent loop, featuring a scripted model provider (scripted-model.ts), a session runner helper (session-harness.ts), and an integration test suite (session-e2e.test.ts). The feedback recommends guarding the scripted model against infinite loops by throwing an error when calls exceed the defined turns, and replacing Unix-specific shell commands like true and test -f with cross-platform Node.js commands to ensure compatibility with Windows environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/core/tests/helpers/scripted-model.ts
Comment thread packages/core/tests/session-e2e.test.ts Outdated
Comment thread packages/core/tests/session-e2e.test.ts Outdated
agjs added 3 commits June 27, 2026 12:33
Drives the wizard reducer through real key-actions and renders each frame to a
VirtualScreen (exactly as the driver does: CLEAR_HOME + renderFrame), asserting
what the USER SEES — step title, cursor gutter on the right option, checkbox
glyphs flipping on toggle, the overview. Same for the command palette's
renderMenu (filtered list + selection). The existing wizard tests assert reducer
STATE only; these close the render blind spot that hid the editor ghost bugs.
7 tests.
Harness gains policyMode. 3 tests guard safety-critical loop behavior: plan mode
rejects a create (no write escapes — the read-only guarantee), plan mode still
allows reads, and the auto-fix command runs before the gate re-validates (proven
via a marker file the gate checks for).
4 tests: query narrows the dropdown + selection gutter renders; empty match shows
the 'no matching file' hint; truncatePath fits the width; and StatusBar.setOverlay
shrinking from 5→2 items leaves NO ghost rows — the same bug class as the editor,
now guarded for the @-picker overlay too.
@agjs agjs marked this pull request as ready for review June 27, 2026 10:40
agjs added 2 commits June 27, 2026 12:42
Exports actionFor so the keypress→action mapping is unit-testable (the raw-mode
TTY plumbing around it stays PTY-only — a documented Bun limit). 5 tests guard
the decode: arrows→nav, space→toggle, enter/return→confirm, escape/ctrl-c→cancel,
b→back, q→cancel, unknown→ignored, and that a decoded action drives the reducer.
Guards the key-mapping regression class (cf. the past wizard arrow-key bug).
…mmands + adversarial hunt

- scripted-model: throw when called past the script (loop didn't terminate) so a
  misconfigured test fails fast instead of hanging (Gemini review).
- session-e2e: replace Unix-only gate/fix commands (true, test -f, touch, exit) with
  portable `node -e` equivalents so the suite runs on Windows too (Gemini review).
- add session-e2e-hunt.test.ts: 7 adversarial probes (non-matching/ambiguous edit,
  create-over-existing, failing gate command, maxTurns exhaustion, malformed tool
  args, non-zero exit) — all pass, so the loop is robust on these edges (now guarded).
@agjs agjs merged commit e9c1b5f into main Jun 27, 2026
8 checks passed
@agjs agjs deleted the feat/e2e-harness branch June 27, 2026 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant