Skip to content

Route native competitors without OpenChrome fallback#1343

Merged
shaun0927 merged 1 commit into
mainfrom
benchmark/pr6-native-competitor-execution
May 18, 2026
Merged

Route native competitors without OpenChrome fallback#1343
shaun0927 merged 1 commit into
mainfrom
benchmark/pr6-native-competitor-execution

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

Summary

  • marks playwright-mcp and browser-use as explicit external native execution targets with setup requirements
  • updates dry-run cost projection to include selected native competitors without making API calls
  • records no-fallback routing metadata so external competitors cannot silently reuse OpenChrome behavior
  • regenerates competitor smoke and readiness artifacts with current version-pinned rows

PR scope validation

In scope for PR6:

Out of scope:

  • reimplementing external competitor behavior
  • OpenChrome product/core changes
  • fault injection
  • full orchestration or paid live provider calls

Verification

  • npm run bench:competitor-smoke
  • npm run bench:webvoyager:real -- --library playwright-mcp --mode native --dry-run
  • npm run bench:webvoyager:real -- --library browser-use --mode native --dry-run
  • npm test -- --runTestsByPath tests/benchmark/adapters/playwright-mcp-adapter.test.ts tests/benchmark/adapters/browser-use-adapter.test.ts tests/benchmark/webvoyager/llm/library-routing.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand
  • npm run bench:readiness
  • npm run build
  • git diff --check

Notes

Stacked on #1342 / benchmark/pr5-real-llm-repetitions.

Co-authored-by: OmX omx@oh-my-codex.dev

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ffbe627fc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

competitorPin: 'playwright-mcp (native MCP)',
nativeLoopWired: false,
note: 'playwright-mcp MCP server driven by the same Claude tool-calling loop. Loop wiring lands next session.',
nativeLoopWired: true,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep competitor nativeLoopWired false until dispatch exists

Marking playwright-mcp as nativeLoopWired: true removes the live-run guard in main() (tests/benchmark/webvoyager/runner.ts) even though execution is still not routed by library: runTask() branches only on adapter and never consumes opts.library/nativeExecution. In this state, --library playwright-mcp (and similarly browser-use) can proceed through the OpenChrome/LLM adapter path, producing mislabeled competitor rows instead of true external-native execution and violating the no-fallback contract.

Useful? React with 👍 / 👎.

Mark playwright-mcp and browser-use as explicit external native execution targets with setup requirements, dry-run cost projections, version-pinned smoke rows, and no-fallback routing metadata.\n\nConstraint: External competitors must use their own MCP/Python bridge surfaces and must not silently reuse OpenChrome behavior.\nRejected: Keep native competitors marked unwired | That blocks dry-run planning and hides dependency-only setup errors from the benchmark matrix.\nConfidence: high\nScope-risk: narrow\nDirective: Passive rows remain secondary; future live rows must keep nativeExecution and forbidsOpenChromeFallback metadata intact.\nTested: npm run bench:competitor-smoke; npm run bench:webvoyager:real -- --library playwright-mcp --mode native --dry-run; npm run bench:webvoyager:real -- --library browser-use --mode native --dry-run; npm test -- --runTestsByPath tests/benchmark/adapters/playwright-mcp-adapter.test.ts tests/benchmark/adapters/browser-use-adapter.test.ts tests/benchmark/webvoyager/llm/library-routing.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand; npm run bench:readiness; npm run build; git diff --check\nNot-tested: Operator-run live native rows with provider credentials.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>
@shaun0927 shaun0927 force-pushed the benchmark/pr5-real-llm-repetitions branch from 542ab43 to d79103e Compare May 18, 2026 03:20
@shaun0927 shaun0927 force-pushed the benchmark/pr6-native-competitor-execution branch from 5ffbe62 to cf9a0ee Compare May 18, 2026 03:20
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf9a0ee960

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

competitorPin: 'playwright-mcp (native MCP)',
nativeLoopWired: false,
note: 'playwright-mcp MCP server driven by the same Claude tool-calling loop. Loop wiring lands next session.',
nativeLoopWired: true,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep nativeLoopWired false until library dispatch exists

Setting nativeLoopWired to true for playwright-mcp lets non-dry runs pass the live library gate, but execution is still chosen only by adapter in runTask and never by opts.library/nativeExecution (tests/benchmark/webvoyager/runner.ts), so --library playwright-mcp can produce mislabeled rows without actually running the external competitor path. This violates the no-fallback guarantee the commit claims to enforce and can corrupt benchmark comparisons.

Useful? React with 👍 / 👎.

Comment on lines +194 to +195
status: 'partial',
measurementReadiness: 'diagnostic_or_smoke_only',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep issue #1302 not_ready until native execution is real

This change marks issue #1302 as partial/diagnostic_or_smoke_only, but the runner still does not route by library (it dispatches only by adapter in tests/benchmark/webvoyager/runner.ts), so the underlying capability is not actually implemented yet. Updating readiness here inflates the summary counts and can mislead release/readiness decisions that rely on this report.

Useful? React with 👍 / 👎.

@shaun0927 shaun0927 changed the base branch from benchmark/pr5-real-llm-repetitions to main May 18, 2026 03:28
@shaun0927 shaun0927 merged commit a9b733c into main May 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant