Plan C: public eval benchmark + Tier-A telemetry by IgnazioDS · Pull Request #3 · IgnazioDS/evalops-workbench

IgnazioDS · 2026-05-26T06:57:33Z

Summary

Graduates EvalOps Workbench from a showcase shell (a CLI that printed project.json) into a working, dependency-free evaluation harness with a public, reproducible benchmark and honest Tier-A telemetry. This is the EvalOps pilot of Plan C (Proof-First).

The system now does what its README promised: load a dataset, run prompt/strategy variants, score with rubric functions, pin a baseline, and surface regressions.

What it ships

Eval engine (src/evalops_workbench/eval/, pure stdlib): dataset loader (JSONL/JSON/CSV), SQuAD-style scorers (exact match, token F1, contains-gold), three deterministic extractive-QA strategies, runner + per-case regression diff, pinned-baseline gate, run ledger.
Public fixture (examples/benchmark-v1/, CC0): 38 hand-authored cases over public-domain facts (26 standard + 12 adversarial). The candidate strategy lifts exact match 0.00 → 0.63 and token F1 +0.38, while silently regressing on 12 of 38 cases — the exact trade-off a regression-tracking harness exists to make visible.
Persistence without a backend: benchmark_runner writes a committed artifact (api/_benchmark_latest.json + bounded api/_benchmark_history.json); Vercel redeploys on push and the endpoints read it. No Upstash / KV / Postgres, no secrets.
/api/benchmark-latest (api/benchmark-latest.py): serves the latest run, schema-conformed with a previous_run delta; pending envelope before the first run.
/api/stats → mode: "live": Tier-A metrics (eval_runs_total, eval_runs_24h, last_pass_rate, rolling_pass_rate_7d, regressions_caught_30d, experiments_tracked) computed from the committed history. regressions_caught_30d unions distinct case ids so re-detecting the same regression never inflates it. Honest degraded fallback, never 5xx.
.github/workflows/benchmark.yml: weekly + workflow_dispatch re-verification that re-runs the (deterministic) benchmark, commits the refreshed artifact back, and validates freshness. Framed as reproducibility verification, not synthetic daily churn.
Dashboard: overview + telemetry now render the live benchmark (variant table, candidate-vs-baseline deltas, surfaced regressions, links to the report + fixture).
Project stage Researching → Prototype.

Honesty

Every value is computed from real committed runs — nothing simulated, seeded, or incremented in memory. The system-under-test is deterministic so any third party reproduces the published numbers with no credentials and no cost. The harness is model-agnostic; a live-LLM target is an optional extension, not used by this benchmark.

Test plan

python -m unittest discover -s tests — 38 pass (engine, endpoints, Tier-A stats)
vitest run — 36 pass
tsc --noEmit clean
next build clean (8 routes)
Benchmark runner reproduces the seeded artifact deterministically
Post-merge: Vercel redeploy serves /api/stats (mode:"live") + /api/benchmark-latest
Post-merge: trigger benchmark.yml via workflow_dispatch to confirm the commit-back loop

Refs: outputs/plans/PLAN_C_PROOF_FIRST.md (Phase 2)

…ark-latest Graduate EvalOps from a showcase shell to a working evaluation harness with a public, reproducible benchmark. Implements the project's published MVP contract. - eval engine (stdlib, src/evalops_workbench/eval/): dataset loader, SQuAD-style scorers (exact match, token F1, contains-gold), three extractive-QA strategies, runner + per-case regression diff, pinned-baseline gate, run ledger - examples/benchmark-v1: 38-case public fixture (CC0 original prose), 26 standard + 12 adversarial. Candidate lifts exact match 0.00 -> 0.63 and token F1 +0.38 while surfacing 12 per-case regressions a regression gate exists to catch - benchmark_runner publishes a committed artifact (api/_benchmark_latest.json + api/_benchmark_history.json), so persistence needs no external store - api/benchmark-latest.py serves the latest run (schema-conformed, previous_run) - api/stats.py flipped to mode:"live" with benchmark-derived metrics (eval_runs, pass rate, distinct regressions caught); honest degraded fallback - .github/workflows/benchmark.yml: weekly + on-demand reproducibility re-run that commits results back and validates artifact freshness - dashboard renders the latest benchmark (variant comparison, deltas, regressions) - project stage Researching -> Prototype All metrics computed from committed runs; nothing simulated or seeded. Tests: 38 unittest + 36 vitest green; next build + tsc clean. Refs: outputs/plans/PLAN_C_PROOF_FIRST.md (Phase 2)

vercel · 2026-05-26T06:57:38Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
evalops-workbench	Ready	Preview, Comment	May 26, 2026 6:58am

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c6806a131

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-26T07:02:36Z


+def _within(record: dict, days: int, now: datetime) -> bool:
+    stamp = _parse_iso(record.get("generated_at"))
+    return stamp is not None and (now - stamp) <= timedelta(days=days)


Exclude future timestamps from time-window filters

The window predicate currently treats future-dated records as in-window because it only checks (now - stamp) <= window. If a benchmark record is written with a clock-skewed or manually edited future generated_at, it will be counted in eval_runs_24h, rolling_pass_rate_7d, and regressions_caught_30d, inflating live telemetry until real time catches up. Add a lower bound (stamp <= now) so only past records are eligible.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-26T07:02:36Z

+    rolling_7d = (
+        round(sum(pass_rates_7d) / len(pass_rates_7d), 4)
+        if pass_rates_7d
+        else float(history[-1].get("pass_rate", 0.0))


Preserve 7-day metric semantics when no recent runs

When there are zero runs in the last 7 days, rolling_pass_rate_7d falls back to the most recent historical pass rate, which can be arbitrarily old. That makes the field report a non-7-day value and masks telemetry staleness during benchmark outages. In that scenario, this metric should be explicitly empty/zeroed (or otherwise marked unavailable) instead of reusing history[-1].

Useful? React with 👍 / 👎.

IgnazioDS · 2026-05-26T07:27:57Z

Superseded by #4: this branch was built against a stale local clone (before the workbench MVP landed on main) and reinvented the engine. #4 keeps workbench.py and adds only the public-proof layer on top.

vercel Bot deployed to Preview May 26, 2026 06:58 View deployment

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

IgnazioDS marked this pull request as draft May 26, 2026 07:13

IgnazioDS mentioned this pull request May 26, 2026

Plan C: public proof layer on the workbench engine #4

Merged

5 tasks

IgnazioDS closed this May 26, 2026

IgnazioDS deleted the feat/plan-c-public-proof branch May 26, 2026 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan C: public eval benchmark + Tier-A telemetry#3

Plan C: public eval benchmark + Tier-A telemetry#3
IgnazioDS wants to merge 1 commit into
mainfrom
feat/plan-c-public-proof

IgnazioDS commented May 26, 2026

Uh oh!

vercel Bot commented May 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Uh oh!

IgnazioDS commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

IgnazioDS commented May 26, 2026

Summary

What it ships

Honesty

Test plan

Uh oh!

vercel Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

IgnazioDS commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 26, 2026 •

edited

Loading