Pin competitor smoke provenance before live runs#1339
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1ad1b79c2a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| expect(playwright?.skipCategory).toBe('not_requested'); | ||
| expect(playwright?.version).toMatch(/^[0-9]+\.[0-9]+\.[0-9]+/); | ||
| expect(playwright?.versionPinned).toBe(true); | ||
| expect(rows.find((row) => row.library === 'OpenChrome')?.version).toBe('1.12.4'); |
There was a problem hiding this comment.
Read OpenChrome version dynamically in smoke test
This assertion hard-codes 1.12.4, so the test will fail on every package version bump even when the smoke matrix behavior is correct. That creates avoidable CI breakage and recurring maintenance churn unrelated to regressions; the expected value should come from package.json (or readRepoVersion()) instead of a fixed literal.
Useful? React with 👍 / 👎.
|
|
||
| const REGISTRY_PINNED_VERSIONS: Readonly<Record<string, string>> = { | ||
| OpenChrome: readRepoVersion(), | ||
| Playwright: '1.49.0', |
There was a problem hiding this comment.
Update Playwright fallback pin to current registry version
The fallback registry map still pins Playwright to 1.49.0, but this change set updates the benchmark registry/docs to 1.60.0. When Playwright is not resolvable in the runtime, the smoke output will emit the stale version while versionPinned remains true, producing incorrect provenance in diagnostic artifacts.
Useful? React with 👍 / 👎.
Make the competitor smoke matrix report concrete runtime versions, dependency availability, and explicit skip categories so later benchmark axes can distinguish unavailable competitors from measured failures. Constraint: Competitor support must remain benchmark-harness only and must invoke real external packages or surface skips. Rejected: Keeping operator-pinned-runtime placeholders in smoke results | it hides whether the local benchmark runtime can actually resolve each competitor. Confidence: high Scope-risk: narrow Directive: Treat not-requested and dependency-missing competitor rows as visible diagnostic skips, never as zero-score measured rows. Tested: npm run bench:competitor-smoke Tested: npm test -- --runTestsByPath tests/benchmark/run-competitor-smoke.test.ts tests/benchmark/adapters/browser-use-adapter.test.ts tests/benchmark/adapters/playwright-mcp-adapter.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand Tested: npm run bench:readiness Tested: npm run build Tested: git diff --check Not-tested: --include-live browser/CDP execution; browser-use installed Python environment; paid LLM paths Co-authored-by: OmX <omx@oh-my-codex.dev>
1ad1b79 to
343ca15
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 343ca15d39
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| const REGISTRY_PINNED_VERSIONS: Readonly<Record<string, string>> = { | ||
| OpenChrome: readRepoVersion(), | ||
| Playwright: '1.49.0', |
There was a problem hiding this comment.
Align Playwright fallback pin with benchmark registry
When playwright cannot be resolved, the smoke runner falls back to this hard-coded version, but this commit updates the benchmark registry/docs to 1.60.0. In that dependency-missing path, generated artifacts will report an outdated Playwright version while still marking versionPinned: true, which makes provenance data inaccurate and can mislead readiness/debugging decisions.
Useful? React with 👍 / 👎.
| expect(playwright?.skipCategory).toBe('not_requested'); | ||
| expect(playwright?.version).toMatch(/^[0-9]+\.[0-9]+\.[0-9]+/); | ||
| expect(playwright?.versionPinned).toBe(true); | ||
| expect(rows.find((row) => row.library === 'OpenChrome')?.version).toBe('1.12.4'); |
There was a problem hiding this comment.
Derive OpenChrome smoke assertion from package version
This assertion hard-codes 1.12.4, so a normal package version bump will fail the smoke test even when the matrix behavior is correct. That creates avoidable CI churn unrelated to functional regressions; the expected value should come from the repo version source used by the runner.
Useful? React with 👍 / 👎.
Summary
bench:competitor-smoketo capture concrete competitor versions and version sourcesnot_requested,dependency_missing, etc.) per rowbenchmark/COMPETITORS.mdfrom placeholder pins to the current smoke/runtime registry pinscompetitor-smoke.jsonand readiness artifacts; stale artifact count drops because the smoke artifact now pins OpenChrome1.12.4Stack
benchmark/pr1-contract-hardening)Scope boundaries
In scope:
Out of scope:
--include-liveChrome/CDP executionVerification
npm run bench:competitor-smokenpm test -- --runTestsByPath tests/benchmark/run-competitor-smoke.test.ts tests/benchmark/adapters/browser-use-adapter.test.ts tests/benchmark/adapters/playwright-mcp-adapter.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBandnpm run bench:readinessnpm run buildgit diff --check