Skip to content

Roadmap: AI agent layer improvements (context, cost, convergence, breakpoints) #20

Description

@gurvinder-dhillon

This issue tracks improvements specific to OpenQA's AI agent layer — things not already covered by Playwright's native reporting, trace viewer, HTML reports, retry logic, or hook system.

Note: Several originally-proposed ideas (structured audit trail, run IDs, real-time observer, step-level hooks, resumable runs) are already handled by Playwright natively — via HTML reports, Trace Viewer, --ui mode, --last-failed, and the reporters API.

The four items below are genuinely new — they address the LLM-specific concerns Playwright has no concept of.


1. Context Compression for Long Scenarios

Problem: As a scenario grows (20+ steps), the agent's MCP tool call history accumulates in the session. The LLM context window fills up, degrading agent quality on later steps. SessionManager persists sessions but doesn't compress them.

Solution: Add a summarization pass on older tool call results — compress step N-5 and earlier into a compact summary, keeping recent steps verbatim. Inspired by babysitter's 4-layer compression (50–67% context reduction in practice).

Impact: Extends the practical length of test scenarios before context overflow causes agent confusion or hallucination.


2. Token/Cost Tracking Per Step

Problem: claudeCode.js already returns usage when returnUsage: true is set, but it's never accumulated or surfaced. Teams have no visibility into which test steps or scenarios are expensive.

Solution: Accumulate per-step token usage into a cost summary at the end of each scenario. Optionally write it to the Playwright test info as an attachment so it shows up in the HTML report.

Impact: Lets teams identify expensive steps, optimise prompts, and budget AI API costs as the test suite grows.


3. Stricter Convergence Enforcement

Problem: Both providers retry zero-tool-call steps up to 2 times, then the step exits. There's no structured failure message and no guidance on why the agent failed to engage with the browser.

Solution: After max retries, throw a structured error that includes what was attempted, how many retries were made, and suggested next steps (e.g. "step may be too ambiguous — try breaking it into two steps"). Modelled on babysitter's BabysitterRuntimeError with suggestions and nextSteps fields.

Impact: Clearer failure modes make debugging faster and prevent silent test passes where the agent narrated an answer without touching the browser.


4. Human Breakpoints in Feature Files

Problem: No way to pause a running test for human review before an irreversible action. Playwright's page.pause() exists but is a dev-only debugging tool — not suitable for a production QA gate.

Solution: Support a special step syntax that pauses the scenario and waits for explicit human approval before continuing:

* Navigate to checkout
* Enter payment details
* [BREAKPOINT] Confirm order total before proceeding
* Click confirm order

Impact: Enables QA sign-off gates for critical paths (payments, deletions, releases) without needing a separate approval workflow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions