Protect benchmark claims before expanding measurements by shaun0927 · Pull Request #1338 · shaun0927/openchrome

shaun0927 · 2026-05-17T15:58:58Z

Summary

validates the benchmark work into an 8-PR dependency ladder and records the plan in docs/benchmarks/benchmark-pr-plan.md
adds stale OpenChrome result-artifact detection to the readiness audit so old benchmark/results/*.json files remain visibly diagnostic after releases
adds shared benchmark result-status vocabulary and blocks diagnostic statuses (skipped, dependency_missing, not_wired, dry_run, mock, scaffold, diagnostic) from headline report rows
refreshes readiness artifacts with the new PR scopes and stale artifact table

Scope boundaries

In scope for this PR:

benchmark harness/reporting contracts
claim/headline safety gates
stale committed result visibility
PR scope validation and ordering

Out of scope:

real LLM execution
competitor native loop implementation
new live benchmark measurements
OpenChrome product/core behavior changes

Verification

npm test -- --runTestsByPath tests/benchmark/benchmark-readiness.test.ts tests/benchmark/utils/artifact-freshness.test.ts tests/benchmark/utils/result-envelope.test.ts tests/benchmark/episode-harness/claim-eligibility.test.ts --runInBand
node benchmark/claim-eligibility.test.mjs
node benchmark/headline-gate.test.mjs
npm run bench:readiness
npm run build
npm run bench:api-key-readiness exits 1 as expected while non-key blockers remain
git diff --check

Notes

This is PR1 of the validated benchmark plan. It intentionally does not make any open benchmark issue headline-ready; it makes the remaining work harder to misreport by surfacing stale artifacts and fail-closing diagnostic rows before later measurement PRs add live/recorded-real evidence.

Define the first benchmark PR slice around shared row status semantics, stale artifact visibility, and fail-closed headline gates so later measurement work cannot promote diagnostic evidence by accident. Constraint: Benchmark work must stay in benchmark harness/docs and avoid OpenChrome product-core behavior changes. Rejected: Implementing live competitor or LLM loops in this PR | those depend on shared claim semantics and belong to later scoped PRs. Confidence: high Scope-risk: narrow Directive: Keep mock, scaffold, dry-run, dependency-missing, and stale-result evidence diagnostic until explicit live/recorded-real eligibility gates pass. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-readiness.test.ts tests/benchmark/utils/artifact-freshness.test.ts tests/benchmark/utils/result-envelope.test.ts tests/benchmark/episode-harness/claim-eligibility.test.ts --runInBand Tested: node benchmark/claim-eligibility.test.mjs; node benchmark/headline-gate.test.mjs Tested: npm run bench:readiness Tested: npm run build Tested: npm run bench:api-key-readiness exits 1 as expected Tested: git diff --check Not-tested: Full live benchmark execution; competitor native loops; paid LLM API paths Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 43e6bb9d62

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T16:02:01Z

+      continue;
+    }
+    const foundVersions = findOpenChromeVersionPins(parsed).filter(
+      (version) => version !== currentOpenChromeVersion && !/^(unknown|operator-pinned-runtime|idiomatic-script-only|TBD)/i.test(version),


Compare versions semantically before marking artifacts stale

The freshness audit currently treats any OpenChrome version mismatch as stale via version !== currentOpenChromeVersion, but the report text says only artifacts with older pins should be flagged. In practice, this will misclassify artifacts generated by a newer OpenChrome version (for example, when auditing an older release branch) as stale, inflating staleResultArtifactCount and keeping valid artifacts in diagnostic status. Please compare versions by semver ordering (older-than current) instead of raw inequality.

Useful? React with 👍 / 👎.

Keep the large progress-log stress test as a regression guard without failing on small scheduler jitter over the old 30s wall-clock cutoff. Constraint: Merge readiness requires P0/P1 blockers to be removed across the benchmark PR stack. Rejected: Removing the stress assertion | it still guards pathological large-log regressions. Confidence: high Scope-risk: narrow Directive: Wall-clock stress assertions must include CI/local load headroom and should not encode a near-observed runtime as the limit. Tested: npm test -- --runTestsByPath tests/orchestration/stress/large-data.test.ts --runInBand; npm run build Not-tested: Full Jest suite after this isolated flaky-threshold fix. Co-authored-by: OmX <omx@oh-my-codex.dev>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56e10fd1bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T03:24:34Z

 */

 const HEADLINE_MODES = new Set(['live', 'live-llm', 'recorded-real']);
+const DIAGNOSTIC_STATUSES = new Set(['skipped', 'dependency_missing', 'not_wired', 'dry_run', 'mock', 'scaffold', 'diagnostic']);


Treat failed/partial result statuses as diagnostic-only

The new headline gate blocks several diagnostic statuses but omits failed and partial, so partitionHeadlineResults will currently accept a row as headline-eligible when it has status: "failed" (or "partial") as long as claimEligibility.eligible is true and postcondition evidence exists. This undermines the fail-closed status gate added in this commit and is inconsistent with the shared benchmark status vocabulary (DIAGNOSTIC_RESULT_STATUSES) that classifies those states as diagnostic.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

shaun0927 mentioned this pull request May 17, 2026

Pin competitor smoke provenance before live runs #1339

Merged

5 tasks

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

shaun0927 merged commit 9677fe6 into main May 18, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protect benchmark claims before expanding measurements#1338

Protect benchmark claims before expanding measurements#1338
shaun0927 merged 2 commits into
mainfrom
benchmark/pr1-contract-hardening

shaun0927 commented May 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shaun0927 commented May 17, 2026

Summary

Scope boundaries

Verification

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant