Skip to content

Pin competitor smoke provenance before live runs#1339

Merged
shaun0927 merged 1 commit into
mainfrom
benchmark/pr2-competitor-smoke-pins
May 18, 2026
Merged

Pin competitor smoke provenance before live runs#1339
shaun0927 merged 1 commit into
mainfrom
benchmark/pr2-competitor-smoke-pins

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

Summary

  • strengthens bench:competitor-smoke to capture concrete competitor versions and version sources
  • records dependency availability plus explicit skip categories (not_requested, dependency_missing, etc.) per row
  • updates benchmark/COMPETITORS.md from placeholder pins to the current smoke/runtime registry pins
  • refreshes competitor-smoke.json and readiness artifacts; stale artifact count drops because the smoke artifact now pins OpenChrome 1.12.4

Stack

Scope boundaries

In scope:

  • competitor smoke/version provenance
  • dependency detection and visible diagnostic skips
  • smoke result artifact refresh

Out of scope:

  • full --include-live Chrome/CDP execution
  • native browser-use/playwright-mcp LLM loops
  • headline comparisons
  • OpenChrome product/core changes

Verification

  • npm run bench:competitor-smoke
  • npm test -- --runTestsByPath tests/benchmark/run-competitor-smoke.test.ts tests/benchmark/adapters/browser-use-adapter.test.ts tests/benchmark/adapters/playwright-mcp-adapter.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand
  • npm run bench:readiness
  • npm run build
  • git diff --check

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1ad1b79c2a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

expect(playwright?.skipCategory).toBe('not_requested');
expect(playwright?.version).toMatch(/^[0-9]+\.[0-9]+\.[0-9]+/);
expect(playwright?.versionPinned).toBe(true);
expect(rows.find((row) => row.library === 'OpenChrome')?.version).toBe('1.12.4');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Read OpenChrome version dynamically in smoke test

This assertion hard-codes 1.12.4, so the test will fail on every package version bump even when the smoke matrix behavior is correct. That creates avoidable CI breakage and recurring maintenance churn unrelated to regressions; the expected value should come from package.json (or readRepoVersion()) instead of a fixed literal.

Useful? React with 👍 / 👎.


const REGISTRY_PINNED_VERSIONS: Readonly<Record<string, string>> = {
OpenChrome: readRepoVersion(),
Playwright: '1.49.0',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update Playwright fallback pin to current registry version

The fallback registry map still pins Playwright to 1.49.0, but this change set updates the benchmark registry/docs to 1.60.0. When Playwright is not resolvable in the runtime, the smoke output will emit the stale version while versionPinned remains true, producing incorrect provenance in diagnostic artifacts.

Useful? React with 👍 / 👎.

Make the competitor smoke matrix report concrete runtime versions, dependency availability, and explicit skip categories so later benchmark axes can distinguish unavailable competitors from measured failures.

Constraint: Competitor support must remain benchmark-harness only and must invoke real external packages or surface skips.

Rejected: Keeping operator-pinned-runtime placeholders in smoke results | it hides whether the local benchmark runtime can actually resolve each competitor.

Confidence: high

Scope-risk: narrow

Directive: Treat not-requested and dependency-missing competitor rows as visible diagnostic skips, never as zero-score measured rows.

Tested: npm run bench:competitor-smoke

Tested: npm test -- --runTestsByPath tests/benchmark/run-competitor-smoke.test.ts tests/benchmark/adapters/browser-use-adapter.test.ts tests/benchmark/adapters/playwright-mcp-adapter.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand

Tested: npm run bench:readiness

Tested: npm run build

Tested: git diff --check

Not-tested: --include-live browser/CDP execution; browser-use installed Python environment; paid LLM paths

Co-authored-by: OmX <omx@oh-my-codex.dev>
@shaun0927 shaun0927 force-pushed the benchmark/pr2-competitor-smoke-pins branch from 1ad1b79 to 343ca15 Compare May 18, 2026 03:20
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 343ca15d39

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


const REGISTRY_PINNED_VERSIONS: Readonly<Record<string, string>> = {
OpenChrome: readRepoVersion(),
Playwright: '1.49.0',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align Playwright fallback pin with benchmark registry

When playwright cannot be resolved, the smoke runner falls back to this hard-coded version, but this commit updates the benchmark registry/docs to 1.60.0. In that dependency-missing path, generated artifacts will report an outdated Playwright version while still marking versionPinned: true, which makes provenance data inaccurate and can mislead readiness/debugging decisions.

Useful? React with 👍 / 👎.

expect(playwright?.skipCategory).toBe('not_requested');
expect(playwright?.version).toMatch(/^[0-9]+\.[0-9]+\.[0-9]+/);
expect(playwright?.versionPinned).toBe(true);
expect(rows.find((row) => row.library === 'OpenChrome')?.version).toBe('1.12.4');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive OpenChrome smoke assertion from package version

This assertion hard-codes 1.12.4, so a normal package version bump will fail the smoke test even when the matrix behavior is correct. That creates avoidable CI churn unrelated to functional regressions; the expected value should come from the repo version source used by the runner.

Useful? React with 👍 / 👎.

@shaun0927 shaun0927 changed the base branch from benchmark/pr1-contract-hardening to main May 18, 2026 03:28
@shaun0927 shaun0927 merged commit 6963ed4 into main May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant