Plan C: public proof layer on the workbench engine#4
Conversation
Layers the Plan C public-proof surface on top of the existing local eval harness (workbench.py). The canonical engine stays the source of truth and gains durable, externally-verifiable public telemetry. - benchmark_runner.py: runs prompt_v1 vs prompt_v2 over examples/support_qa through workbench.run_evaluation / compare_runs / assess_gate, and publishes a schema-conformed artifact (api/_benchmark_latest.json + bounded history) - api/benchmark-latest.py: serves the latest run (stdlib, previous_run delta, pending envelope before first run) - api/stats.py -> mode:"live" with benchmark-derived metrics (eval runs, pass rate, distinct regressions caught over 30d); honest degraded fallback - .github/workflows/benchmark.yml: weekly + on-demand re-run that commits the refreshed artifact back and validates freshness - dashboard overview + telemetry now read Tier-A and render the latest run (variant comparison, deltas, gate), preserving the /prototype link - vercel.json: CORS + cache headers for /api/benchmark-latest Seeded run: prompt_v2 passes 4/4 vs prompt_v1 0/4 (+100% pass rate, 0 regressions, gate pass). Every value is computed from committed runs; nothing simulated or seeded. Tests: 25 unittest + 36 vitest; next build clean. The engine (workbench.py) is untouched. Refs: outputs/plans/PLAN_C_PROOF_FIRST.md (Phase 2)
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f6737239be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| git add api/_benchmark_latest.json api/_benchmark_history.json \ | ||
| examples/benchmark-v1/results examples/benchmark-v1/pinned-baseline.json |
There was a problem hiding this comment.
Stage benchmark outputs from the correct directory
The git add command in the benchmark workflow still references examples/benchmark-v1/..., but this commit writes outputs under examples/benchmark/... (archive/ and latest-report.md). On GitHub Actions, git add with nonexistent pathspecs exits non-zero, so the scheduled/manual benchmark job fails before commit/push and the public artifacts never refresh. Update this step to add the actual output paths produced by benchmark_runner.py.
Useful? React with 👍 / 👎.
Distinguishes EvalOps' synthetic benchmark workload from production user traffic (NexusRAG) on the shared homepage telemetry grid. Fleet-wide field agreed for all Tier-A prototypes.
package-lock.json is gitignored (scaffold default) and Vercel deploys with
npm install, so cache:"npm" + npm ci failed ("lock file is not found").
Match the repo install strategy so CI is green.
Summary
Layers the Plan C public-proof surface on top of the existing
workbench.pyeval engine (the local-harness MVP already onmain). The engine stays the source of truth; EvalOps gains durable, externally-verifiable public telemetry.This supersedes #3, which was built against a stale local clone (before the workbench MVP landed) and reinvented the engine. This PR keeps your engine and adds only the missing layer.
What it adds
benchmark_runner.py— runsprompt_v1vsprompt_v2overexamples/support_qa.jsonthroughworkbench.run_evaluation/compare_runs/assess_gate, and publishes a schema-conformed artifact (api/_benchmark_latest.json+ boundedapi/_benchmark_history.json). DuckDB engine runs in CI; the deployed endpoints only read committed JSON.api/benchmark-latest.py— serves the latest run (stdlib,previous_rundelta,pendingenvelope before the first run).api/stats.py→mode: "live"— Tier-A metrics computed from committed history (eval_runs_total/24h,last_pass_rate,rolling_pass_rate_7d,regressions_caught_30das a distinct-id union,experiments_tracked). Honestdegradedfallback, never 5xx..github/workflows/benchmark.yml— weekly +workflow_dispatchre-verification that commits the refreshed artifact back and validates freshness./prototypelink is preserved.vercel.json— CORS + cache headers for/api/benchmark-latest.Seeded run (committed)
prompt_v2(grounded) passes 4/4 vsprompt_v1(baseline) 0/4: pass rate +100 pts, 0 regressions, 4 improvements, gate PASS. Every value computed from committed runs; nothing simulated or seeded.Honesty
mode: "live"reflects a real recurring computed workload (the benchmark), persisted durably and reproducible offline. The engine (workbench.py) is untouched.Test plan
uv run python -m unittest discover -s tests— 25 pass (workbench + cli + stats + endpoints + runner)vitest run— 36 passnext buildclean (9 routes)/api/stats(mode:"live") +/api/benchmark-latest; lights up/work/evalops-workbenchon eleventh.devHeld for your review (not auto-merged). Refs:
outputs/plans/PLAN_C_PROOF_FIRST.md