Skip to content

Plan C: public proof layer on the workbench engine#4

Merged
IgnazioDS merged 3 commits into
mainfrom
feat/plan-c-proof-layer
May 26, 2026
Merged

Plan C: public proof layer on the workbench engine#4
IgnazioDS merged 3 commits into
mainfrom
feat/plan-c-proof-layer

Conversation

@IgnazioDS
Copy link
Copy Markdown
Owner

Summary

Layers the Plan C public-proof surface on top of the existing workbench.py eval engine (the local-harness MVP already on main). The engine stays the source of truth; EvalOps gains durable, externally-verifiable public telemetry.

This supersedes #3, which was built against a stale local clone (before the workbench MVP landed) and reinvented the engine. This PR keeps your engine and adds only the missing layer.

What it adds

  • benchmark_runner.py — runs prompt_v1 vs prompt_v2 over examples/support_qa.json through workbench.run_evaluation / compare_runs / assess_gate, and publishes a schema-conformed artifact (api/_benchmark_latest.json + bounded api/_benchmark_history.json). DuckDB engine runs in CI; the deployed endpoints only read committed JSON.
  • api/benchmark-latest.py — serves the latest run (stdlib, previous_run delta, pending envelope before the first run).
  • api/stats.pymode: "live" — Tier-A metrics computed from committed history (eval_runs_total/24h, last_pass_rate, rolling_pass_rate_7d, regressions_caught_30d as a distinct-id union, experiments_tracked). Honest degraded fallback, never 5xx.
  • .github/workflows/benchmark.yml — weekly + workflow_dispatch re-verification that commits the refreshed artifact back and validates freshness.
  • Dashboard — overview + telemetry read Tier-A and render the latest run (variant comparison, deltas, gate verdict). Your /prototype link is preserved.
  • vercel.json — CORS + cache headers for /api/benchmark-latest.

Seeded run (committed)

prompt_v2 (grounded) passes 4/4 vs prompt_v1 (baseline) 0/4: pass rate +100 pts, 0 regressions, 4 improvements, gate PASS. Every value computed from committed runs; nothing simulated or seeded.

Honesty

mode: "live" reflects a real recurring computed workload (the benchmark), persisted durably and reproducible offline. The engine (workbench.py) is untouched.

Test plan

  • uv run python -m unittest discover -s tests — 25 pass (workbench + cli + stats + endpoints + runner)
  • vitest run — 36 pass
  • next build clean (9 routes)
  • Runner reproduces the seeded artifact deterministically
  • Post-merge: Vercel serves /api/stats (mode:"live") + /api/benchmark-latest; lights up /work/evalops-workbench on eleventh.dev

Held for your review (not auto-merged). Refs: outputs/plans/PLAN_C_PROOF_FIRST.md

Layers the Plan C public-proof surface on top of the existing local eval
harness (workbench.py). The canonical engine stays the source of truth and
gains durable, externally-verifiable public telemetry.

- benchmark_runner.py: runs prompt_v1 vs prompt_v2 over examples/support_qa
  through workbench.run_evaluation / compare_runs / assess_gate, and publishes
  a schema-conformed artifact (api/_benchmark_latest.json + bounded history)
- api/benchmark-latest.py: serves the latest run (stdlib, previous_run delta,
  pending envelope before first run)
- api/stats.py -> mode:"live" with benchmark-derived metrics (eval runs, pass
  rate, distinct regressions caught over 30d); honest degraded fallback
- .github/workflows/benchmark.yml: weekly + on-demand re-run that commits the
  refreshed artifact back and validates freshness
- dashboard overview + telemetry now read Tier-A and render the latest run
  (variant comparison, deltas, gate), preserving the /prototype link
- vercel.json: CORS + cache headers for /api/benchmark-latest

Seeded run: prompt_v2 passes 4/4 vs prompt_v1 0/4 (+100% pass rate, 0
regressions, gate pass). Every value is computed from committed runs; nothing
simulated or seeded. Tests: 25 unittest + 36 vitest; next build clean. The
engine (workbench.py) is untouched.

Refs: outputs/plans/PLAN_C_PROOF_FIRST.md (Phase 2)
@vercel
Copy link
Copy Markdown

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
evalops-workbench Ready Ready Preview, Comment May 26, 2026 9:14am

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6737239be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +58 to +59
git add api/_benchmark_latest.json api/_benchmark_history.json \
examples/benchmark-v1/results examples/benchmark-v1/pinned-baseline.json
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stage benchmark outputs from the correct directory

The git add command in the benchmark workflow still references examples/benchmark-v1/..., but this commit writes outputs under examples/benchmark/... (archive/ and latest-report.md). On GitHub Actions, git add with nonexistent pathspecs exits non-zero, so the scheduled/manual benchmark job fails before commit/push and the public artifacts never refresh. Update this step to add the actual output paths produced by benchmark_runner.py.

Useful? React with 👍 / 👎.

Distinguishes EvalOps' synthetic benchmark workload from production user
traffic (NexusRAG) on the shared homepage telemetry grid. Fleet-wide field
agreed for all Tier-A prototypes.
package-lock.json is gitignored (scaffold default) and Vercel deploys with
npm install, so cache:"npm" + npm ci failed ("lock file is not found").
Match the repo install strategy so CI is green.
@IgnazioDS IgnazioDS merged commit 7fea53b into main May 26, 2026
3 checks passed
@IgnazioDS IgnazioDS deleted the feat/plan-c-proof-layer branch May 26, 2026 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant