Roadmap: AI agent layer improvements (context, cost, convergence, breakpoints)

This issue tracks improvements specific to OpenQA's **AI agent layer** — things not already covered by Playwright's native reporting, trace viewer, HTML reports, retry logic, or hook system.

> **Note:** Several originally-proposed ideas (structured audit trail, run IDs, real-time observer, step-level hooks, resumable runs) are already handled by Playwright natively — via HTML reports, Trace Viewer, `--ui` mode, `--last-failed`, and the reporters API.

The four items below are genuinely new — they address the LLM-specific concerns Playwright has no concept of.

---

### 1. Context Compression for Long Scenarios

**Problem:** As a scenario grows (20+ steps), the agent's MCP tool call history accumulates in the session. The LLM context window fills up, degrading agent quality on later steps. `SessionManager` persists sessions but doesn't compress them.

**Solution:** Add a summarization pass on older tool call results — compress step N-5 and earlier into a compact summary, keeping recent steps verbatim. Inspired by babysitter's 4-layer compression (50–67% context reduction in practice).

**Impact:** Extends the practical length of test scenarios before context overflow causes agent confusion or hallucination.

---

### 2. Token/Cost Tracking Per Step

**Problem:** `claudeCode.js` already returns `usage` when `returnUsage: true` is set, but it's never accumulated or surfaced. Teams have no visibility into which test steps or scenarios are expensive.

**Solution:** Accumulate per-step token usage into a cost summary at the end of each scenario. Optionally write it to the Playwright test info as an attachment so it shows up in the HTML report.

**Impact:** Lets teams identify expensive steps, optimise prompts, and budget AI API costs as the test suite grows.

---

### 3. Stricter Convergence Enforcement

**Problem:** Both providers retry zero-tool-call steps up to 2 times, then the step exits. There's no structured failure message and no guidance on why the agent failed to engage with the browser.

**Solution:** After max retries, throw a structured error that includes what was attempted, how many retries were made, and suggested next steps (e.g. "step may be too ambiguous — try breaking it into two steps"). Modelled on babysitter's `BabysitterRuntimeError` with `suggestions` and `nextSteps` fields.

**Impact:** Clearer failure modes make debugging faster and prevent silent test passes where the agent narrated an answer without touching the browser.

---

### 4. Human Breakpoints in Feature Files

**Problem:** No way to pause a running test for human review before an irreversible action. Playwright's `page.pause()` exists but is a dev-only debugging tool — not suitable for a production QA gate.

**Solution:** Support a special step syntax that pauses the scenario and waits for explicit human approval before continuing:

```gherkin
* Navigate to checkout
* Enter payment details
* [BREAKPOINT] Confirm order total before proceeding
* Click confirm order
```

**Impact:** Enables QA sign-off gates for critical paths (payments, deletions, releases) without needing a separate approval workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Roadmap: AI agent layer improvements (context, cost, convergence, breakpoints) #20

1. Context Compression for Long Scenarios

2. Token/Cost Tracking Per Step

3. Stricter Convergence Enforcement

4. Human Breakpoints in Feature Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Roadmap: AI agent layer improvements (context, cost, convergence, breakpoints) #20

Description

1. Context Compression for Long Scenarios

2. Token/Cost Tracking Per Step

3. Stricter Convergence Enforcement

4. Human Breakpoints in Feature Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions