Protect benchmark claims before expanding measurements#1338
Conversation
Define the first benchmark PR slice around shared row status semantics, stale artifact visibility, and fail-closed headline gates so later measurement work cannot promote diagnostic evidence by accident. Constraint: Benchmark work must stay in benchmark harness/docs and avoid OpenChrome product-core behavior changes. Rejected: Implementing live competitor or LLM loops in this PR | those depend on shared claim semantics and belong to later scoped PRs. Confidence: high Scope-risk: narrow Directive: Keep mock, scaffold, dry-run, dependency-missing, and stale-result evidence diagnostic until explicit live/recorded-real eligibility gates pass. Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-readiness.test.ts tests/benchmark/utils/artifact-freshness.test.ts tests/benchmark/utils/result-envelope.test.ts tests/benchmark/episode-harness/claim-eligibility.test.ts --runInBand Tested: node benchmark/claim-eligibility.test.mjs; node benchmark/headline-gate.test.mjs Tested: npm run bench:readiness Tested: npm run build Tested: npm run bench:api-key-readiness exits 1 as expected Tested: git diff --check Not-tested: Full live benchmark execution; competitor native loops; paid LLM API paths Co-authored-by: OmX <omx@oh-my-codex.dev>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 43e6bb9d62
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| continue; | ||
| } | ||
| const foundVersions = findOpenChromeVersionPins(parsed).filter( | ||
| (version) => version !== currentOpenChromeVersion && !/^(unknown|operator-pinned-runtime|idiomatic-script-only|TBD)/i.test(version), |
There was a problem hiding this comment.
Compare versions semantically before marking artifacts stale
The freshness audit currently treats any OpenChrome version mismatch as stale via version !== currentOpenChromeVersion, but the report text says only artifacts with older pins should be flagged. In practice, this will misclassify artifacts generated by a newer OpenChrome version (for example, when auditing an older release branch) as stale, inflating staleResultArtifactCount and keeping valid artifacts in diagnostic status. Please compare versions by semver ordering (older-than current) instead of raw inequality.
Useful? React with 👍 / 👎.
Keep the large progress-log stress test as a regression guard without failing on small scheduler jitter over the old 30s wall-clock cutoff. Constraint: Merge readiness requires P0/P1 blockers to be removed across the benchmark PR stack. Rejected: Removing the stress assertion | it still guards pathological large-log regressions. Confidence: high Scope-risk: narrow Directive: Wall-clock stress assertions must include CI/local load headroom and should not encode a near-observed runtime as the limit. Tested: npm test -- --runTestsByPath tests/orchestration/stress/large-data.test.ts --runInBand; npm run build Not-tested: Full Jest suite after this isolated flaky-threshold fix. Co-authored-by: OmX <omx@oh-my-codex.dev>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56e10fd1bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| */ | ||
|
|
||
| const HEADLINE_MODES = new Set(['live', 'live-llm', 'recorded-real']); | ||
| const DIAGNOSTIC_STATUSES = new Set(['skipped', 'dependency_missing', 'not_wired', 'dry_run', 'mock', 'scaffold', 'diagnostic']); |
There was a problem hiding this comment.
Treat failed/partial result statuses as diagnostic-only
The new headline gate blocks several diagnostic statuses but omits failed and partial, so partitionHeadlineResults will currently accept a row as headline-eligible when it has status: "failed" (or "partial") as long as claimEligibility.eligible is true and postcondition evidence exists. This undermines the fail-closed status gate added in this commit and is inconsistent with the shared benchmark status vocabulary (DIAGNOSTIC_RESULT_STATUSES) that classifies those states as diagnostic.
Useful? React with 👍 / 👎.
Summary
docs/benchmarks/benchmark-pr-plan.mdbenchmark/results/*.jsonfiles remain visibly diagnostic after releasesskipped,dependency_missing,not_wired,dry_run,mock,scaffold,diagnostic) from headline report rowsScope boundaries
In scope for this PR:
Out of scope:
Verification
npm test -- --runTestsByPath tests/benchmark/benchmark-readiness.test.ts tests/benchmark/utils/artifact-freshness.test.ts tests/benchmark/utils/result-envelope.test.ts tests/benchmark/episode-harness/claim-eligibility.test.ts --runInBandnode benchmark/claim-eligibility.test.mjsnode benchmark/headline-gate.test.mjsnpm run bench:readinessnpm run buildnpm run bench:api-key-readinessexits 1 as expected while non-key blockers remaingit diff --checkNotes
This is PR1 of the validated benchmark plan. It intentionally does not make any open benchmark issue headline-ready; it makes the remaining work harder to misreport by surfacing stale artifacts and fail-closing diagnostic rows before later measurement PRs add live/recorded-real evidence.