Route native competitors without OpenChrome fallback#1343
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ffbe627fc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| competitorPin: 'playwright-mcp (native MCP)', | ||
| nativeLoopWired: false, | ||
| note: 'playwright-mcp MCP server driven by the same Claude tool-calling loop. Loop wiring lands next session.', | ||
| nativeLoopWired: true, |
There was a problem hiding this comment.
Keep competitor nativeLoopWired false until dispatch exists
Marking playwright-mcp as nativeLoopWired: true removes the live-run guard in main() (tests/benchmark/webvoyager/runner.ts) even though execution is still not routed by library: runTask() branches only on adapter and never consumes opts.library/nativeExecution. In this state, --library playwright-mcp (and similarly browser-use) can proceed through the OpenChrome/LLM adapter path, producing mislabeled competitor rows instead of true external-native execution and violating the no-fallback contract.
Useful? React with 👍 / 👎.
Mark playwright-mcp and browser-use as explicit external native execution targets with setup requirements, dry-run cost projections, version-pinned smoke rows, and no-fallback routing metadata.\n\nConstraint: External competitors must use their own MCP/Python bridge surfaces and must not silently reuse OpenChrome behavior.\nRejected: Keep native competitors marked unwired | That blocks dry-run planning and hides dependency-only setup errors from the benchmark matrix.\nConfidence: high\nScope-risk: narrow\nDirective: Passive rows remain secondary; future live rows must keep nativeExecution and forbidsOpenChromeFallback metadata intact.\nTested: npm run bench:competitor-smoke; npm run bench:webvoyager:real -- --library playwright-mcp --mode native --dry-run; npm run bench:webvoyager:real -- --library browser-use --mode native --dry-run; npm test -- --runTestsByPath tests/benchmark/adapters/playwright-mcp-adapter.test.ts tests/benchmark/adapters/browser-use-adapter.test.ts tests/benchmark/webvoyager/llm/library-routing.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand; npm run bench:readiness; npm run build; git diff --check\nNot-tested: Operator-run live native rows with provider credentials.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>
542ab43 to
d79103e
Compare
5ffbe62 to
cf9a0ee
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf9a0ee960
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| competitorPin: 'playwright-mcp (native MCP)', | ||
| nativeLoopWired: false, | ||
| note: 'playwright-mcp MCP server driven by the same Claude tool-calling loop. Loop wiring lands next session.', | ||
| nativeLoopWired: true, |
There was a problem hiding this comment.
Keep nativeLoopWired false until library dispatch exists
Setting nativeLoopWired to true for playwright-mcp lets non-dry runs pass the live library gate, but execution is still chosen only by adapter in runTask and never by opts.library/nativeExecution (tests/benchmark/webvoyager/runner.ts), so --library playwright-mcp can produce mislabeled rows without actually running the external competitor path. This violates the no-fallback guarantee the commit claims to enforce and can corrupt benchmark comparisons.
Useful? React with 👍 / 👎.
| status: 'partial', | ||
| measurementReadiness: 'diagnostic_or_smoke_only', |
There was a problem hiding this comment.
Keep issue #1302 not_ready until native execution is real
This change marks issue #1302 as partial/diagnostic_or_smoke_only, but the runner still does not route by library (it dispatches only by adapter in tests/benchmark/webvoyager/runner.ts), so the underlying capability is not actually implemented yet. Updating readiness here inflates the summary counts and can mislead release/readiness decisions that rely on this report.
Useful? React with 👍 / 👎.
Summary
PR scope validation
In scope for PR6:
Out of scope:
Verification
npm run bench:competitor-smokenpm run bench:webvoyager:real -- --library playwright-mcp --mode native --dry-runnpm run bench:webvoyager:real -- --library browser-use --mode native --dry-runnpm test -- --runTestsByPath tests/benchmark/adapters/playwright-mcp-adapter.test.ts tests/benchmark/adapters/browser-use-adapter.test.ts tests/benchmark/webvoyager/llm/library-routing.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBandnpm run bench:readinessnpm run buildgit diff --checkNotes
Stacked on #1342 /
benchmark/pr5-real-llm-repetitions.Co-authored-by: OmX omx@oh-my-codex.dev