This issue tracks improvements specific to OpenQA's AI agent layer — things not already covered by Playwright's native reporting, trace viewer, HTML reports, retry logic, or hook system.
Note: Several originally-proposed ideas (structured audit trail, run IDs, real-time observer, step-level hooks, resumable runs) are already handled by Playwright natively — via HTML reports, Trace Viewer, --ui mode, --last-failed, and the reporters API.
The four items below are genuinely new — they address the LLM-specific concerns Playwright has no concept of.
1. Context Compression for Long Scenarios
Problem: As a scenario grows (20+ steps), the agent's MCP tool call history accumulates in the session. The LLM context window fills up, degrading agent quality on later steps. SessionManager persists sessions but doesn't compress them.
Solution: Add a summarization pass on older tool call results — compress step N-5 and earlier into a compact summary, keeping recent steps verbatim. Inspired by babysitter's 4-layer compression (50–67% context reduction in practice).
Impact: Extends the practical length of test scenarios before context overflow causes agent confusion or hallucination.
2. Token/Cost Tracking Per Step
Problem: claudeCode.js already returns usage when returnUsage: true is set, but it's never accumulated or surfaced. Teams have no visibility into which test steps or scenarios are expensive.
Solution: Accumulate per-step token usage into a cost summary at the end of each scenario. Optionally write it to the Playwright test info as an attachment so it shows up in the HTML report.
Impact: Lets teams identify expensive steps, optimise prompts, and budget AI API costs as the test suite grows.
3. Stricter Convergence Enforcement
Problem: Both providers retry zero-tool-call steps up to 2 times, then the step exits. There's no structured failure message and no guidance on why the agent failed to engage with the browser.
Solution: After max retries, throw a structured error that includes what was attempted, how many retries were made, and suggested next steps (e.g. "step may be too ambiguous — try breaking it into two steps"). Modelled on babysitter's BabysitterRuntimeError with suggestions and nextSteps fields.
Impact: Clearer failure modes make debugging faster and prevent silent test passes where the agent narrated an answer without touching the browser.
4. Human Breakpoints in Feature Files
Problem: No way to pause a running test for human review before an irreversible action. Playwright's page.pause() exists but is a dev-only debugging tool — not suitable for a production QA gate.
Solution: Support a special step syntax that pauses the scenario and waits for explicit human approval before continuing:
* Navigate to checkout
* Enter payment details
* [BREAKPOINT] Confirm order total before proceeding
* Click confirm order
Impact: Enables QA sign-off gates for critical paths (payments, deletions, releases) without needing a separate approval workflow.
This issue tracks improvements specific to OpenQA's AI agent layer — things not already covered by Playwright's native reporting, trace viewer, HTML reports, retry logic, or hook system.
The four items below are genuinely new — they address the LLM-specific concerns Playwright has no concept of.
1. Context Compression for Long Scenarios
Problem: As a scenario grows (20+ steps), the agent's MCP tool call history accumulates in the session. The LLM context window fills up, degrading agent quality on later steps.
SessionManagerpersists sessions but doesn't compress them.Solution: Add a summarization pass on older tool call results — compress step N-5 and earlier into a compact summary, keeping recent steps verbatim. Inspired by babysitter's 4-layer compression (50–67% context reduction in practice).
Impact: Extends the practical length of test scenarios before context overflow causes agent confusion or hallucination.
2. Token/Cost Tracking Per Step
Problem:
claudeCode.jsalready returnsusagewhenreturnUsage: trueis set, but it's never accumulated or surfaced. Teams have no visibility into which test steps or scenarios are expensive.Solution: Accumulate per-step token usage into a cost summary at the end of each scenario. Optionally write it to the Playwright test info as an attachment so it shows up in the HTML report.
Impact: Lets teams identify expensive steps, optimise prompts, and budget AI API costs as the test suite grows.
3. Stricter Convergence Enforcement
Problem: Both providers retry zero-tool-call steps up to 2 times, then the step exits. There's no structured failure message and no guidance on why the agent failed to engage with the browser.
Solution: After max retries, throw a structured error that includes what was attempted, how many retries were made, and suggested next steps (e.g. "step may be too ambiguous — try breaking it into two steps"). Modelled on babysitter's
BabysitterRuntimeErrorwithsuggestionsandnextStepsfields.Impact: Clearer failure modes make debugging faster and prevent silent test passes where the agent narrated an answer without touching the browser.
4. Human Breakpoints in Feature Files
Problem: No way to pause a running test for human review before an irreversible action. Playwright's
page.pause()exists but is a dev-only debugging tool — not suitable for a production QA gate.Solution: Support a special step syntax that pauses the scenario and waits for explicit human approval before continuing:
Impact: Enables QA sign-off gates for critical paths (payments, deletions, releases) without needing a separate approval workflow.