Stress real-world tasks through final postconditions by shaun0927 · Pull Request #1344 · shaun0927/openchrome

shaun0927 · 2026-05-17T16:30:19Z

Summary

adds stress mode for the controlled real-world task corpus with one deterministic fault checkpoint per task
records fault evidence, recovery timing/steps, Chrome RSS, and zombie-process fields on stress rows
defines recovery as final postcondition success after injected fault, not merely surviving an exception
expands the real-world report with fault-stress rows and refreshes reliability/readiness artifacts

PR scope validation

In scope for PR7:

Benchmark #D: Reliability & Fault-Recovery — recovery rate, flaky rate, leak/zombie #1259 reliability stress diagnostics and process-sampling fields
Benchmark #D follow-up: inject reliability faults inside real-world tasks #1303 fault injection inside real-world tasks
Benchmark #D follow-up: real-world task completion as primary reliability signal #1304 recovery judged by task-completion postconditions

Out of scope:

new task taxonomy beyond PR4
real LLM provider implementation
competitor native wiring
headline promotion without live/recorded-real gates

Verification

npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts --runInBand
npm run bench:realworld:stress
npm run bench:reliability
npm run bench:readiness
npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand
npm run build
git diff --check

Notes

Stacked on #1343 / benchmark/pr6-native-competitor-execution.

Co-authored-by: OmX omx@oh-my-codex.dev

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dae5581f38

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T16:33:40Z

+    const plan = STRESS_FAULTS_BY_TASK[run.taskId];
+    const success = true;
+    return {


Guard missing stress fault plans before dereferencing

This assumes every task ID has an entry in STRESS_FAULTS_BY_TASK, but there is no guard before reading plan.fault. If a task is added/renamed in realWorldTaskSpecs without updating the map, bench:realworld:stress will crash at runtime with a Cannot read properties of undefined error instead of producing actionable diagnostics.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-17T16:33:40Z

+  const stress = argv.includes('--stress');
  const recordedSamples = recordingDir ? loadRecordedSamples(recordingDir) : [];
-  const runs = recordingDir ? recordedSamplesToRuns(recordedSamples) : deterministicOpenChromeFixtureRuns();
+  const runs = recordingDir ? recordedSamplesToRuns(recordedSamples) : stress ? deterministicOpenChromeStressRuns() : deterministicOpenChromeFixtureRuns();


Honor --stress when a recording directory is provided

The run-selection logic ignores --stress whenever --recording-dir is present, but later fields still use the stress flag (issue and stressMode). In that combination, the output can be labeled as stress mode without any injected-fault runs, which misstates what was actually measured and can corrupt readiness/reporting artifacts.

Useful? React with 👍 / 👎.

Inject deterministic reliability faults inside the controlled real-world corpus, record recovery timing/RSS/zombie fields, and judge recovery only by final postcondition evidence.\n\nConstraint: Stress rows are local deterministic diagnostics, not live publishable reliability claims.\nRejected: Count injected-fault handling as recovered before final task verification | Reliability is user-visible only when the final task postcondition passes.\nConfidence: high\nScope-risk: moderate\nDirective: Future live stress runs must preserve faultCheckpoint, finalPostconditionEvidence, and diagnostic-vs-headline eligibility fields.\nTested: npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts --runInBand; npm run bench:realworld:stress; npm run bench:reliability; npm run bench:readiness; npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand; npm run build; git diff --check\nNot-tested: Live Chrome/CDP fault injection and operator-run process sampling.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>

shaun0927 mentioned this pull request May 17, 2026

Gate full benchmark release through preflight #1345

Merged

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

shaun0927 force-pushed the benchmark/pr6-native-competitor-execution branch from 5ffbe62 to cf9a0ee Compare May 18, 2026 03:20

shaun0927 force-pushed the benchmark/pr7-realworld-fault-stress branch from dae5581 to ec00369 Compare May 18, 2026 03:20

shaun0927 changed the base branch from benchmark/pr6-native-competitor-execution to main May 18, 2026 03:28

shaun0927 merged commit 148d2ac into main May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress real-world tasks through final postconditions#1344

Stress real-world tasks through final postconditions#1344
shaun0927 merged 1 commit into
mainfrom
benchmark/pr7-realworld-fault-stress

shaun0927 commented May 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shaun0927 commented May 17, 2026

Summary

PR scope validation

Verification

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant