Plan C: public proof layer on the workbench engine by IgnazioDS · Pull Request #4 · IgnazioDS/evalops-workbench

IgnazioDS · 2026-05-26T07:27:18Z

Summary

Layers the Plan C public-proof surface on top of the existing workbench.py eval engine (the local-harness MVP already on main). The engine stays the source of truth; EvalOps gains durable, externally-verifiable public telemetry.

This supersedes #3, which was built against a stale local clone (before the workbench MVP landed) and reinvented the engine. This PR keeps your engine and adds only the missing layer.

What it adds

benchmark_runner.py — runs prompt_v1 vs prompt_v2 over examples/support_qa.json through workbench.run_evaluation / compare_runs / assess_gate, and publishes a schema-conformed artifact (api/_benchmark_latest.json + bounded api/_benchmark_history.json). DuckDB engine runs in CI; the deployed endpoints only read committed JSON.
api/benchmark-latest.py — serves the latest run (stdlib, previous_run delta, pending envelope before the first run).
api/stats.py → mode: "live" — Tier-A metrics computed from committed history (eval_runs_total/24h, last_pass_rate, rolling_pass_rate_7d, regressions_caught_30d as a distinct-id union, experiments_tracked). Honest degraded fallback, never 5xx.
.github/workflows/benchmark.yml — weekly + workflow_dispatch re-verification that commits the refreshed artifact back and validates freshness.
Dashboard — overview + telemetry read Tier-A and render the latest run (variant comparison, deltas, gate verdict). Your /prototype link is preserved.
vercel.json — CORS + cache headers for /api/benchmark-latest.

Seeded run (committed)

prompt_v2 (grounded) passes 4/4 vs prompt_v1 (baseline) 0/4: pass rate +100 pts, 0 regressions, 4 improvements, gate PASS. Every value computed from committed runs; nothing simulated or seeded.

Honesty

mode: "live" reflects a real recurring computed workload (the benchmark), persisted durably and reproducible offline. The engine (workbench.py) is untouched.

Test plan

uv run python -m unittest discover -s tests — 25 pass (workbench + cli + stats + endpoints + runner)
vitest run — 36 pass
next build clean (9 routes)
Runner reproduces the seeded artifact deterministically
Post-merge: Vercel serves /api/stats (mode:"live") + /api/benchmark-latest; lights up /work/evalops-workbench on eleventh.dev

Held for your review (not auto-merged). Refs: outputs/plans/PLAN_C_PROOF_FIRST.md

Layers the Plan C public-proof surface on top of the existing local eval harness (workbench.py). The canonical engine stays the source of truth and gains durable, externally-verifiable public telemetry. - benchmark_runner.py: runs prompt_v1 vs prompt_v2 over examples/support_qa through workbench.run_evaluation / compare_runs / assess_gate, and publishes a schema-conformed artifact (api/_benchmark_latest.json + bounded history) - api/benchmark-latest.py: serves the latest run (stdlib, previous_run delta, pending envelope before first run) - api/stats.py -> mode:"live" with benchmark-derived metrics (eval runs, pass rate, distinct regressions caught over 30d); honest degraded fallback - .github/workflows/benchmark.yml: weekly + on-demand re-run that commits the refreshed artifact back and validates freshness - dashboard overview + telemetry now read Tier-A and render the latest run (variant comparison, deltas, gate), preserving the /prototype link - vercel.json: CORS + cache headers for /api/benchmark-latest Seeded run: prompt_v2 passes 4/4 vs prompt_v1 0/4 (+100% pass rate, 0 regressions, gate pass). Every value is computed from committed runs; nothing simulated or seeded. Tests: 25 unittest + 36 vitest; next build clean. The engine (workbench.py) is untouched. Refs: outputs/plans/PLAN_C_PROOF_FIRST.md (Phase 2)

vercel · 2026-05-26T07:27:23Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
evalops-workbench	Ready	Preview, Comment	May 26, 2026 9:14am

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6737239be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-26T07:30:57Z

+          git add api/_benchmark_latest.json api/_benchmark_history.json \
+                  examples/benchmark-v1/results examples/benchmark-v1/pinned-baseline.json


Stage benchmark outputs from the correct directory

The git add command in the benchmark workflow still references examples/benchmark-v1/..., but this commit writes outputs under examples/benchmark/... (archive/ and latest-report.md). On GitHub Actions, git add with nonexistent pathspecs exits non-zero, so the scheduled/manual benchmark job fails before commit/push and the public artifacts never refresh. Update this step to add the actual output paths produced by benchmark_runner.py.

Useful? React with 👍 / 👎.

Distinguishes EvalOps' synthetic benchmark workload from production user traffic (NexusRAG) on the shared homepage telemetry grid. Fleet-wide field agreed for all Tier-A prototypes.

package-lock.json is gitignored (scaffold default) and Vercel deploys with npm install, so cache:"npm" + npm ci failed ("lock file is not found"). Match the repo install strategy so CI is green.

IgnazioDS mentioned this pull request May 26, 2026

Plan C: public eval benchmark + Tier-A telemetry #3

Closed

7 tasks

vercel Bot deployed to Preview May 26, 2026 07:28 View deployment

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

feat(stats): add workload:"benchmark" disclosure to the live envelope

072c65c

Distinguishes EvalOps' synthetic benchmark workload from production user traffic (NexusRAG) on the shared homepage telemetry grid. Fleet-wide field agreed for all Tier-A prototypes.

vercel Bot deployed to Preview May 26, 2026 08:01 View deployment

ci: fix quality workflow — drop npm cache, use npm install

a70ca56

package-lock.json is gitignored (scaffold default) and Vercel deploys with npm install, so cache:"npm" + npm ci failed ("lock file is not found"). Match the repo install strategy so CI is green.

vercel Bot deployed to Preview May 26, 2026 09:14 View deployment

IgnazioDS merged commit 7fea53b into main May 26, 2026
3 checks passed

IgnazioDS deleted the feat/plan-c-proof-layer branch May 26, 2026 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan C: public proof layer on the workbench engine#4

Plan C: public proof layer on the workbench engine#4
IgnazioDS merged 3 commits into
mainfrom
feat/plan-c-proof-layer

IgnazioDS commented May 26, 2026

Uh oh!

vercel Bot commented May 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		git add api/_benchmark_latest.json api/_benchmark_history.json \
		examples/benchmark-v1/results examples/benchmark-v1/pinned-baseline.json

Conversation

IgnazioDS commented May 26, 2026

Summary

What it adds

Seeded run (committed)

Honesty

Test plan

Uh oh!

vercel Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 26, 2026 •

edited

Loading