Skip to content

Stress real-world tasks through final postconditions#1344

Merged
shaun0927 merged 1 commit into
mainfrom
benchmark/pr7-realworld-fault-stress
May 18, 2026
Merged

Stress real-world tasks through final postconditions#1344
shaun0927 merged 1 commit into
mainfrom
benchmark/pr7-realworld-fault-stress

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

Summary

  • adds stress mode for the controlled real-world task corpus with one deterministic fault checkpoint per task
  • records fault evidence, recovery timing/steps, Chrome RSS, and zombie-process fields on stress rows
  • defines recovery as final postcondition success after injected fault, not merely surviving an exception
  • expands the real-world report with fault-stress rows and refreshes reliability/readiness artifacts

PR scope validation

In scope for PR7:

Out of scope:

  • new task taxonomy beyond PR4
  • real LLM provider implementation
  • competitor native wiring
  • headline promotion without live/recorded-real gates

Verification

  • npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts --runInBand
  • npm run bench:realworld:stress
  • npm run bench:reliability
  • npm run bench:readiness
  • npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand
  • npm run build
  • git diff --check

Notes

Stacked on #1343 / benchmark/pr6-native-competitor-execution.

Co-authored-by: OmX omx@oh-my-codex.dev

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dae5581f38

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +164 to +166
const plan = STRESS_FAULTS_BY_TASK[run.taskId];
const success = true;
return {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard missing stress fault plans before dereferencing

This assumes every task ID has an entry in STRESS_FAULTS_BY_TASK, but there is no guard before reading plan.fault. If a task is added/renamed in realWorldTaskSpecs without updating the map, bench:realworld:stress will crash at runtime with a Cannot read properties of undefined error instead of producing actionable diagnostics.

Useful? React with 👍 / 👎.

const stress = argv.includes('--stress');
const recordedSamples = recordingDir ? loadRecordedSamples(recordingDir) : [];
const runs = recordingDir ? recordedSamplesToRuns(recordedSamples) : deterministicOpenChromeFixtureRuns();
const runs = recordingDir ? recordedSamplesToRuns(recordedSamples) : stress ? deterministicOpenChromeStressRuns() : deterministicOpenChromeFixtureRuns();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor --stress when a recording directory is provided

The run-selection logic ignores --stress whenever --recording-dir is present, but later fields still use the stress flag (issue and stressMode). In that combination, the output can be labeled as stress mode without any injected-fault runs, which misstates what was actually measured and can corrupt readiness/reporting artifacts.

Useful? React with 👍 / 👎.

Inject deterministic reliability faults inside the controlled real-world corpus, record recovery timing/RSS/zombie fields, and judge recovery only by final postcondition evidence.\n\nConstraint: Stress rows are local deterministic diagnostics, not live publishable reliability claims.\nRejected: Count injected-fault handling as recovered before final task verification | Reliability is user-visible only when the final task postcondition passes.\nConfidence: high\nScope-risk: moderate\nDirective: Future live stress runs must preserve faultCheckpoint, finalPostconditionEvidence, and diagnostic-vs-headline eligibility fields.\nTested: npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts --runInBand; npm run bench:realworld:stress; npm run bench:reliability; npm run bench:readiness; npm test -- --runTestsByPath tests/benchmark/realworld-task-completion/stress.test.ts tests/benchmark/reliability/episode-fault-hooks.test.ts tests/benchmark/benchmark-readiness.test.ts --runInBand; npm run build; git diff --check\nNot-tested: Live Chrome/CDP fault injection and operator-run process sampling.\n\nCo-authored-by: OmX <omx@oh-my-codex.dev>
@shaun0927 shaun0927 force-pushed the benchmark/pr6-native-competitor-execution branch from 5ffbe62 to cf9a0ee Compare May 18, 2026 03:20
@shaun0927 shaun0927 force-pushed the benchmark/pr7-realworld-fault-stress branch from dae5581 to ec00369 Compare May 18, 2026 03:20
@shaun0927 shaun0927 changed the base branch from benchmark/pr6-native-competitor-execution to main May 18, 2026 03:28
@shaun0927 shaun0927 merged commit 148d2ac into main May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant