Skip to content

Protect benchmark claims before expanding measurements#1338

Merged
shaun0927 merged 2 commits into
mainfrom
benchmark/pr1-contract-hardening
May 18, 2026
Merged

Protect benchmark claims before expanding measurements#1338
shaun0927 merged 2 commits into
mainfrom
benchmark/pr1-contract-hardening

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

Summary

  • validates the benchmark work into an 8-PR dependency ladder and records the plan in docs/benchmarks/benchmark-pr-plan.md
  • adds stale OpenChrome result-artifact detection to the readiness audit so old benchmark/results/*.json files remain visibly diagnostic after releases
  • adds shared benchmark result-status vocabulary and blocks diagnostic statuses (skipped, dependency_missing, not_wired, dry_run, mock, scaffold, diagnostic) from headline report rows
  • refreshes readiness artifacts with the new PR scopes and stale artifact table

Scope boundaries

In scope for this PR:

  • benchmark harness/reporting contracts
  • claim/headline safety gates
  • stale committed result visibility
  • PR scope validation and ordering

Out of scope:

  • real LLM execution
  • competitor native loop implementation
  • new live benchmark measurements
  • OpenChrome product/core behavior changes

Verification

  • npm test -- --runTestsByPath tests/benchmark/benchmark-readiness.test.ts tests/benchmark/utils/artifact-freshness.test.ts tests/benchmark/utils/result-envelope.test.ts tests/benchmark/episode-harness/claim-eligibility.test.ts --runInBand
  • node benchmark/claim-eligibility.test.mjs
  • node benchmark/headline-gate.test.mjs
  • npm run bench:readiness
  • npm run build
  • npm run bench:api-key-readiness exits 1 as expected while non-key blockers remain
  • git diff --check

Notes

This is PR1 of the validated benchmark plan. It intentionally does not make any open benchmark issue headline-ready; it makes the remaining work harder to misreport by surfacing stale artifacts and fail-closing diagnostic rows before later measurement PRs add live/recorded-real evidence.

Define the first benchmark PR slice around shared row status semantics, stale artifact visibility, and fail-closed headline gates so later measurement work cannot promote diagnostic evidence by accident.

Constraint: Benchmark work must stay in benchmark harness/docs and avoid OpenChrome product-core behavior changes.

Rejected: Implementing live competitor or LLM loops in this PR | those depend on shared claim semantics and belong to later scoped PRs.

Confidence: high

Scope-risk: narrow

Directive: Keep mock, scaffold, dry-run, dependency-missing, and stale-result evidence diagnostic until explicit live/recorded-real eligibility gates pass.

Tested: npm test -- --runTestsByPath tests/benchmark/benchmark-readiness.test.ts tests/benchmark/utils/artifact-freshness.test.ts tests/benchmark/utils/result-envelope.test.ts tests/benchmark/episode-harness/claim-eligibility.test.ts --runInBand

Tested: node benchmark/claim-eligibility.test.mjs; node benchmark/headline-gate.test.mjs

Tested: npm run bench:readiness

Tested: npm run build

Tested: npm run bench:api-key-readiness exits 1 as expected

Tested: git diff --check

Not-tested: Full live benchmark execution; competitor native loops; paid LLM API paths

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 43e6bb9d62

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

continue;
}
const foundVersions = findOpenChromeVersionPins(parsed).filter(
(version) => version !== currentOpenChromeVersion && !/^(unknown|operator-pinned-runtime|idiomatic-script-only|TBD)/i.test(version),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compare versions semantically before marking artifacts stale

The freshness audit currently treats any OpenChrome version mismatch as stale via version !== currentOpenChromeVersion, but the report text says only artifacts with older pins should be flagged. In practice, this will misclassify artifacts generated by a newer OpenChrome version (for example, when auditing an older release branch) as stale, inflating staleResultArtifactCount and keeping valid artifacts in diagnostic status. Please compare versions by semver ordering (older-than current) instead of raw inequality.

Useful? React with 👍 / 👎.

Keep the large progress-log stress test as a regression guard without failing on small scheduler jitter over the old 30s wall-clock cutoff.

Constraint: Merge readiness requires P0/P1 blockers to be removed across the benchmark PR stack.

Rejected: Removing the stress assertion | it still guards pathological large-log regressions.

Confidence: high

Scope-risk: narrow

Directive: Wall-clock stress assertions must include CI/local load headroom and should not encode a near-observed runtime as the limit.

Tested: npm test -- --runTestsByPath tests/orchestration/stress/large-data.test.ts --runInBand; npm run build

Not-tested: Full Jest suite after this isolated flaky-threshold fix.

Co-authored-by: OmX <omx@oh-my-codex.dev>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56e10fd1bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

*/

const HEADLINE_MODES = new Set(['live', 'live-llm', 'recorded-real']);
const DIAGNOSTIC_STATUSES = new Set(['skipped', 'dependency_missing', 'not_wired', 'dry_run', 'mock', 'scaffold', 'diagnostic']);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat failed/partial result statuses as diagnostic-only

The new headline gate blocks several diagnostic statuses but omits failed and partial, so partitionHeadlineResults will currently accept a row as headline-eligible when it has status: "failed" (or "partial") as long as claimEligibility.eligible is true and postcondition evidence exists. This undermines the fail-closed status gate added in this commit and is inconsistent with the shared benchmark status vocabulary (DIAGNOSTIC_RESULT_STATUSES) that classifies those states as diagnostic.

Useful? React with 👍 / 👎.

@shaun0927 shaun0927 merged commit 9677fe6 into main May 18, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant