test(e2e): replicate the real user experience across subsystems#53
Conversation
Foundation for replicating the real user experience in tests, reusable by every later flow. ScriptedModel is a deterministic IProvider that drives the REAL agent loop from a sequence of turns; runScriptedSession wires it to a real Session with real tools + real gate over a temp cwd and captures the observable event stream (ILoopEvent) + file changes. 6 Tier-1 tests prove the loop end-to-end: conversational turn (no changes), create-tool → file on disk, passing gate → done, failing gate → repair → green → done, out-of-scope create rejected, run tool executes a shell command. No live LLM, fully deterministic.
3 more session e2e tests: the edit tool replaces a snippet in a seeded file; a read tool result flows back into the conversation (the reactive turn copies the read value into a new file, proving the tool-result→model loop); a multi-file build creates every file in one session. Reuses the scripted-model harness.
There was a problem hiding this comment.
Code Review
This pull request introduces a deterministic end-to-end testing harness for the agent loop, featuring a scripted model provider (scripted-model.ts), a session runner helper (session-harness.ts), and an integration test suite (session-e2e.test.ts). The feedback recommends guarding the scripted model against infinite loops by throwing an error when calls exceed the defined turns, and replacing Unix-specific shell commands like true and test -f with cross-platform Node.js commands to ensure compatibility with Windows environments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Drives the wizard reducer through real key-actions and renders each frame to a VirtualScreen (exactly as the driver does: CLEAR_HOME + renderFrame), asserting what the USER SEES — step title, cursor gutter on the right option, checkbox glyphs flipping on toggle, the overview. Same for the command palette's renderMenu (filtered list + selection). The existing wizard tests assert reducer STATE only; these close the render blind spot that hid the editor ghost bugs. 7 tests.
Harness gains policyMode. 3 tests guard safety-critical loop behavior: plan mode rejects a create (no write escapes — the read-only guarantee), plan mode still allows reads, and the auto-fix command runs before the gate re-validates (proven via a marker file the gate checks for).
4 tests: query narrows the dropdown + selection gutter renders; empty match shows the 'no matching file' hint; truncatePath fits the width; and StatusBar.setOverlay shrinking from 5→2 items leaves NO ghost rows — the same bug class as the editor, now guarded for the @-picker overlay too.
Exports actionFor so the keypress→action mapping is unit-testable (the raw-mode TTY plumbing around it stays PTY-only — a documented Bun limit). 5 tests guard the decode: arrows→nav, space→toggle, enter/return→confirm, escape/ctrl-c→cancel, b→back, q→cancel, unknown→ignored, and that a decoded action drives the reducer. Guards the key-mapping regression class (cf. the past wizard arrow-key bug).
…mmands + adversarial hunt - scripted-model: throw when called past the script (loop didn't terminate) so a misconfigured test fails fast instead of hanging (Gemini review). - session-e2e: replace Unix-only gate/fix commands (true, test -f, touch, exit) with portable `node -e` equivalents so the suite runs on Windows too (Gemini review). - add session-e2e-hunt.test.ts: 7 adversarial probes (non-matching/ambiguous edit, create-over-existing, failing gate command, maxTurns exhaustion, malformed tool args, non-zero exit) — all pass, so the loop is robust on these edges (now guarded).
Generalizes the editor's VirtualScreen win so the real user experience is replayable in tests — asserting observable behavior (rendered screen, tool runs, file changes, gate verdicts), not just logical state. Goal: catch breakage in CI before a human hits it by hand.
What's covered
Reusable harnesses
ScriptedModel(tests/helpers/scripted-model.ts) — a deterministicIProviderdriving the REAL agent loop from scripted turns; a turn can be a function of the conversation so far, so it reacts to gate feedback / tool results.runScriptedSession(tests/helpers/session-harness.ts) — wires it to a realSession(real tools, real gate, temp cwd), captures theILoopEventstream + file changes.VirtualScreen(shipped in v0.24.0) — renders emitted bytes to a screen grid; reused here for every TTY-render assertion.Tier 1 — agent session loop (12 tests)
Conversational turn, create→disk, passing gate→done, failing gate→repair→green→done, out-of-scope rejection, run-tool shell exec, edit-tool snippet replace, read round-trip (tool result reaches the model), multi-file build, plan-mode write rejection (the read-only guarantee), plan-mode read allowed, auto-fix runs before re-gate.
Tier 3 — interactive TTY overlays (11 tests)
Wizard rendered at each step (title, cursor gutter, checkbox toggle, overview), command palette (filtered list + selection), @-file picker dropdown, and overlay-shrink → no ghost rows (the same bug class as the editor, now guarded for the picker). These close the render blind spot: the existing wizard tests assert reducer STATE only.
Findings (no duplication added)
scaffold-run.test.ts(fullrunScaffold: clone+configure+boot+gate, skipBoot, invalid-config refusal, manifest-source-of-truth, astro). No gap.review-change.test.tsdrivesreviewChangewith stub providers (find/verify passes, non-JSON). No gap.The two genuine gaps were the session-loop e2e and the TTY-render e2e — both filled here.
Stacked off main after v0.24.0.