Stress real-world tasks through final postconditions#1344
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dae5581f38
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const plan = STRESS_FAULTS_BY_TASK[run.taskId]; | ||
| const success = true; | ||
| return { |
There was a problem hiding this comment.
Guard missing stress fault plans before dereferencing
This assumes every task ID has an entry in STRESS_FAULTS_BY_TASK, but there is no guard before reading plan.fault. If a task is added/renamed in realWorldTaskSpecs without updating the map, bench:realworld:stress will crash at runtime with a Cannot read properties of undefined error instead of producing actionable diagnostics.
Useful? React with 👍 / 👎.
| const stress = argv.includes('--stress'); | ||
| const recordedSamples = recordingDir ? loadRecordedSamples(recordingDir) : []; | ||
| const runs = recordingDir ? recordedSamplesToRuns(recordedSamples) : deterministicOpenChromeFixtureRuns(); | ||
| const runs = recordingDir ? recordedSamplesToRuns(recordedSamples) : stress ? deterministicOpenChromeStressRuns() : deterministicOpenChromeFixtureRuns(); |
There was a problem hiding this comment.
Honor --stress when a recording directory is provided
The run-selection logic ignores --stress whenever --recording-dir is present, but later fields still use the stress flag (issue and stressMode). In that combination, the output can be labeled as stress mode without any injected-fault runs, which misstates what was actually measured and can corrupt readiness/reporting artifacts.
Useful? React with 👍 / 👎.
Inject deterministic reliability faults inside the controlled real-world corpus, record recovery timing/RSS/zombie fields, and judge recovery only by final postcondition evidence.\n\nConstraint: Stress rows are local deterministic diagnostics, not live publishable reliability claims.\nRejected: Count injected-fault handling as recovered before final task verification | Reliability is user-visible only when the final task postcondition passes.\nConfidence: high\nScope-risk: moderate\nDirective: Future live stress runs must preserve faultCheckpoint, finalPostconditionEvidence, and diagnostic-vs-headline eligibility fields.\nTested: npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts --runInBand; npm run bench:realworld:stress; npm run bench:reliability; npm run bench:readiness; npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand; npm run build; git diff --check\nNot-tested: Live Chrome/CDP fault injection and operator-run process sampling.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>
5ffbe62 to
cf9a0ee
Compare
dae5581 to
ec00369
Compare
Summary
PR scope validation
In scope for PR7:
Out of scope:
Verification
npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts --runInBandnpm run bench:realworld:stressnpm run bench:reliabilitynpm run bench:readinessnpm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBandnpm run buildgit diff --checkNotes
Stacked on #1343 /
benchmark/pr6-native-competitor-execution.Co-authored-by: OmX omx@oh-my-codex.dev