Plan C: public eval benchmark + Tier-A telemetry#3
Conversation
…ark-latest Graduate EvalOps from a showcase shell to a working evaluation harness with a public, reproducible benchmark. Implements the project's published MVP contract. - eval engine (stdlib, src/evalops_workbench/eval/): dataset loader, SQuAD-style scorers (exact match, token F1, contains-gold), three extractive-QA strategies, runner + per-case regression diff, pinned-baseline gate, run ledger - examples/benchmark-v1: 38-case public fixture (CC0 original prose), 26 standard + 12 adversarial. Candidate lifts exact match 0.00 -> 0.63 and token F1 +0.38 while surfacing 12 per-case regressions a regression gate exists to catch - benchmark_runner publishes a committed artifact (api/_benchmark_latest.json + api/_benchmark_history.json), so persistence needs no external store - api/benchmark-latest.py serves the latest run (schema-conformed, previous_run) - api/stats.py flipped to mode:"live" with benchmark-derived metrics (eval_runs, pass rate, distinct regressions caught); honest degraded fallback - .github/workflows/benchmark.yml: weekly + on-demand reproducibility re-run that commits results back and validates artifact freshness - dashboard renders the latest benchmark (variant comparison, deltas, regressions) - project stage Researching -> Prototype All metrics computed from committed runs; nothing simulated or seeded. Tests: 38 unittest + 36 vitest green; next build + tsc clean. Refs: outputs/plans/PLAN_C_PROOF_FIRST.md (Phase 2)
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8c6806a131
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| def _within(record: dict, days: int, now: datetime) -> bool: | ||
| stamp = _parse_iso(record.get("generated_at")) | ||
| return stamp is not None and (now - stamp) <= timedelta(days=days) |
There was a problem hiding this comment.
Exclude future timestamps from time-window filters
The window predicate currently treats future-dated records as in-window because it only checks (now - stamp) <= window. If a benchmark record is written with a clock-skewed or manually edited future generated_at, it will be counted in eval_runs_24h, rolling_pass_rate_7d, and regressions_caught_30d, inflating live telemetry until real time catches up. Add a lower bound (stamp <= now) so only past records are eligible.
Useful? React with 👍 / 👎.
| rolling_7d = ( | ||
| round(sum(pass_rates_7d) / len(pass_rates_7d), 4) | ||
| if pass_rates_7d | ||
| else float(history[-1].get("pass_rate", 0.0)) |
There was a problem hiding this comment.
Preserve 7-day metric semantics when no recent runs
When there are zero runs in the last 7 days, rolling_pass_rate_7d falls back to the most recent historical pass rate, which can be arbitrarily old. That makes the field report a non-7-day value and masks telemetry staleness during benchmark outages. In that scenario, this metric should be explicitly empty/zeroed (or otherwise marked unavailable) instead of reusing history[-1].
Useful? React with 👍 / 👎.
Summary
Graduates EvalOps Workbench from a showcase shell (a CLI that printed
project.json) into a working, dependency-free evaluation harness with a public, reproducible benchmark and honest Tier-A telemetry. This is the EvalOps pilot of Plan C (Proof-First).The system now does what its README promised: load a dataset, run prompt/strategy variants, score with rubric functions, pin a baseline, and surface regressions.
What it ships
src/evalops_workbench/eval/, pure stdlib): dataset loader (JSONL/JSON/CSV), SQuAD-style scorers (exact match, token F1, contains-gold), three deterministic extractive-QA strategies, runner + per-case regression diff, pinned-baseline gate, run ledger.examples/benchmark-v1/, CC0): 38 hand-authored cases over public-domain facts (26 standard + 12 adversarial). The candidate strategy lifts exact match 0.00 → 0.63 and token F1 +0.38, while silently regressing on 12 of 38 cases — the exact trade-off a regression-tracking harness exists to make visible.benchmark_runnerwrites a committed artifact (api/_benchmark_latest.json+ boundedapi/_benchmark_history.json); Vercel redeploys on push and the endpoints read it. No Upstash / KV / Postgres, no secrets./api/benchmark-latest(api/benchmark-latest.py): serves the latest run, schema-conformed with aprevious_rundelta;pendingenvelope before the first run./api/stats→mode: "live": Tier-A metrics (eval_runs_total,eval_runs_24h,last_pass_rate,rolling_pass_rate_7d,regressions_caught_30d,experiments_tracked) computed from the committed history.regressions_caught_30dunions distinct case ids so re-detecting the same regression never inflates it. Honestdegradedfallback, never 5xx..github/workflows/benchmark.yml: weekly +workflow_dispatchre-verification that re-runs the (deterministic) benchmark, commits the refreshed artifact back, and validates freshness. Framed as reproducibility verification, not synthetic daily churn.Honesty
Every value is computed from real committed runs — nothing simulated, seeded, or incremented in memory. The system-under-test is deterministic so any third party reproduces the published numbers with no credentials and no cost. The harness is model-agnostic; a live-LLM target is an optional extension, not used by this benchmark.
Test plan
python -m unittest discover -s tests— 38 pass (engine, endpoints, Tier-A stats)vitest run— 36 passtsc --noEmitcleannext buildclean (8 routes)/api/stats(mode:"live") +/api/benchmark-latestbenchmark.ymlviaworkflow_dispatchto confirm the commit-back loopRefs:
outputs/plans/PLAN_C_PROOF_FIRST.md(Phase 2)