Skip to content

Plan C: public eval benchmark + Tier-A telemetry#3

Closed
IgnazioDS wants to merge 1 commit into
mainfrom
feat/plan-c-public-proof
Closed

Plan C: public eval benchmark + Tier-A telemetry#3
IgnazioDS wants to merge 1 commit into
mainfrom
feat/plan-c-public-proof

Conversation

@IgnazioDS
Copy link
Copy Markdown
Owner

Summary

Graduates EvalOps Workbench from a showcase shell (a CLI that printed project.json) into a working, dependency-free evaluation harness with a public, reproducible benchmark and honest Tier-A telemetry. This is the EvalOps pilot of Plan C (Proof-First).

The system now does what its README promised: load a dataset, run prompt/strategy variants, score with rubric functions, pin a baseline, and surface regressions.

What it ships

  • Eval engine (src/evalops_workbench/eval/, pure stdlib): dataset loader (JSONL/JSON/CSV), SQuAD-style scorers (exact match, token F1, contains-gold), three deterministic extractive-QA strategies, runner + per-case regression diff, pinned-baseline gate, run ledger.
  • Public fixture (examples/benchmark-v1/, CC0): 38 hand-authored cases over public-domain facts (26 standard + 12 adversarial). The candidate strategy lifts exact match 0.00 → 0.63 and token F1 +0.38, while silently regressing on 12 of 38 cases — the exact trade-off a regression-tracking harness exists to make visible.
  • Persistence without a backend: benchmark_runner writes a committed artifact (api/_benchmark_latest.json + bounded api/_benchmark_history.json); Vercel redeploys on push and the endpoints read it. No Upstash / KV / Postgres, no secrets.
  • /api/benchmark-latest (api/benchmark-latest.py): serves the latest run, schema-conformed with a previous_run delta; pending envelope before the first run.
  • /api/statsmode: "live": Tier-A metrics (eval_runs_total, eval_runs_24h, last_pass_rate, rolling_pass_rate_7d, regressions_caught_30d, experiments_tracked) computed from the committed history. regressions_caught_30d unions distinct case ids so re-detecting the same regression never inflates it. Honest degraded fallback, never 5xx.
  • .github/workflows/benchmark.yml: weekly + workflow_dispatch re-verification that re-runs the (deterministic) benchmark, commits the refreshed artifact back, and validates freshness. Framed as reproducibility verification, not synthetic daily churn.
  • Dashboard: overview + telemetry now render the live benchmark (variant table, candidate-vs-baseline deltas, surfaced regressions, links to the report + fixture).
  • Project stage Researching → Prototype.

Honesty

Every value is computed from real committed runs — nothing simulated, seeded, or incremented in memory. The system-under-test is deterministic so any third party reproduces the published numbers with no credentials and no cost. The harness is model-agnostic; a live-LLM target is an optional extension, not used by this benchmark.

Test plan

  • python -m unittest discover -s tests — 38 pass (engine, endpoints, Tier-A stats)
  • vitest run — 36 pass
  • tsc --noEmit clean
  • next build clean (8 routes)
  • Benchmark runner reproduces the seeded artifact deterministically
  • Post-merge: Vercel redeploy serves /api/stats (mode:"live") + /api/benchmark-latest
  • Post-merge: trigger benchmark.yml via workflow_dispatch to confirm the commit-back loop

Refs: outputs/plans/PLAN_C_PROOF_FIRST.md (Phase 2)

…ark-latest

Graduate EvalOps from a showcase shell to a working evaluation harness with a
public, reproducible benchmark. Implements the project's published MVP contract.

- eval engine (stdlib, src/evalops_workbench/eval/): dataset loader, SQuAD-style
  scorers (exact match, token F1, contains-gold), three extractive-QA strategies,
  runner + per-case regression diff, pinned-baseline gate, run ledger
- examples/benchmark-v1: 38-case public fixture (CC0 original prose), 26 standard
  + 12 adversarial. Candidate lifts exact match 0.00 -> 0.63 and token F1 +0.38
  while surfacing 12 per-case regressions a regression gate exists to catch
- benchmark_runner publishes a committed artifact (api/_benchmark_latest.json +
  api/_benchmark_history.json), so persistence needs no external store
- api/benchmark-latest.py serves the latest run (schema-conformed, previous_run)
- api/stats.py flipped to mode:"live" with benchmark-derived metrics
  (eval_runs, pass rate, distinct regressions caught); honest degraded fallback
- .github/workflows/benchmark.yml: weekly + on-demand reproducibility re-run that
  commits results back and validates artifact freshness
- dashboard renders the latest benchmark (variant comparison, deltas, regressions)
- project stage Researching -> Prototype

All metrics computed from committed runs; nothing simulated or seeded.
Tests: 38 unittest + 36 vitest green; next build + tsc clean.

Refs: outputs/plans/PLAN_C_PROOF_FIRST.md (Phase 2)
@vercel
Copy link
Copy Markdown

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
evalops-workbench Ready Ready Preview, Comment May 26, 2026 6:58am

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c6806a131

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread api/stats.py

def _within(record: dict, days: int, now: datetime) -> bool:
stamp = _parse_iso(record.get("generated_at"))
return stamp is not None and (now - stamp) <= timedelta(days=days)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude future timestamps from time-window filters

The window predicate currently treats future-dated records as in-window because it only checks (now - stamp) <= window. If a benchmark record is written with a clock-skewed or manually edited future generated_at, it will be counted in eval_runs_24h, rolling_pass_rate_7d, and regressions_caught_30d, inflating live telemetry until real time catches up. Add a lower bound (stamp <= now) so only past records are eligible.

Useful? React with 👍 / 👎.

Comment thread api/stats.py
rolling_7d = (
round(sum(pass_rates_7d) / len(pass_rates_7d), 4)
if pass_rates_7d
else float(history[-1].get("pass_rate", 0.0))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve 7-day metric semantics when no recent runs

When there are zero runs in the last 7 days, rolling_pass_rate_7d falls back to the most recent historical pass rate, which can be arbitrarily old. That makes the field report a non-7-day value and masks telemetry staleness during benchmark outages. In that scenario, this metric should be explicitly empty/zeroed (or otherwise marked unavailable) instead of reusing history[-1].

Useful? React with 👍 / 👎.

@IgnazioDS IgnazioDS marked this pull request as draft May 26, 2026 07:13
@IgnazioDS
Copy link
Copy Markdown
Owner Author

Superseded by #4: this branch was built against a stale local clone (before the workbench MVP landed on main) and reinvented the engine. #4 keeps workbench.py and adds only the public-proof layer on top.

@IgnazioDS IgnazioDS closed this May 26, 2026
@IgnazioDS IgnazioDS deleted the feat/plan-c-public-proof branch May 26, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant