test(evals): add offline fidelity regression gate (recall-based, zero-model) by agaonker · Pull Request #1187 · chopratejas/headroom

agaonker · 2026-06-20T07:17:13Z

Description

Headroom's lossy compression drops rows/lines using statistical heuristics but never checks that meaning survived — a dropped OOM killed worker 3 line can silently flip a model's answer with no signal that compression caused it. The repo already ships a quality-metric toolkit (headroom/evals/metrics.py) and a weekly-suite eval job, but neither gates the compression path on a PR.

This adds a per-PR fidelity regression gate: compress vendored golden tool-outputs through SmartCrusher's lossy path and assert the evidence that answers each case's question survives. It is the first of a planned trio (this is the "offline gate" half of the fidelity work); query-aware retention and a hard token-budget API are documented follow-ups.

Closes #

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Performance improvement
Code refactoring (no functional changes)

Changes Made

Blocking gate (tests/test_compression_fidelity_regression.py): compresses each golden case via smart_crush_tool_output(..., with_compaction=False) and scores with compute_information_recall. Two assertions:
- Per-case critical recall == 1.0 — every answer_evidence string (placed in error/anomaly rows, the documented SmartCrusher retention guarantee) must survive.
- Aggregate recall ≥ committed baseline (baseline.json, tol 0.02) — catches softer regressions.
Vendored fixtures (tests/fixtures/fidelity_golden/): deterministic _generate.py emits cases.json (4 cases: OOM crash, payment exception, latency anomaly, CI failure) + baseline.json.
Non-blocking weekly report (.github/workflows/eval.yml): one step in the existing weekly-suite job (schedule/manual only) reuses the existing evaluate_information_retention runner for a recall report on the production routing path.
Pure reuse: scoring (evals/metrics.py), compressor (smart_crush_tool_output), and the weekly runner (evaluate_information_retention) all already existed.

Design notes

Zero new CI setup. The blocking gate runs in the existing [dev] test shard — no new workflow, no new deps, no model, no network, no secrets (verified under HF_HUB_OFFLINE=1). It deliberately uses small hand-made structured fixtures rather than the repo's HuggingFace dataset loaders, which would require a network download + ModernBERT and don't belong in a fast PR gate.
Scope: structured JSON tool-output (the dominant, deterministic, model-free case). Real-dataset (HotpotQA/BFCL) recall — which needs [all] + a local model — is a documented follow-up PR, and the weekly-suite job (which genuinely runs every Monday) is its natural home.

Testing

Unit tests pass (pytest)
Linting passes (ruff check .)
Type checking passes (mypy headroom)
New tests added for new functionality
Manual testing performed

Test Output

$ HF_HUB_OFFLINE=1 python -m pytest tests/test_compression_fidelity_regression.py -v
tests/test_compression_fidelity_regression.py::test_critical_evidence_survives_compression[logs_oom] PASSED
tests/test_compression_fidelity_regression.py::test_critical_evidence_survives_compression[payment_exception] PASSED
tests/test_compression_fidelity_regression.py::test_critical_evidence_survives_compression[latency_anomaly] PASSED
tests/test_compression_fidelity_regression.py::test_critical_evidence_survives_compression[ci_test_failures] PASSED
tests/test_compression_fidelity_regression.py::test_aggregate_recall_not_regressed PASSED
============================== 5 passed in 0.18s ===============================

Real Behavior Proof

Environment: local checkout of feat/fidelity-regression-gate, pip install -e ".[dev]", HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 (proves no model/network).
Exact command / steps: HF_HUB_OFFLINE=1 python -m pytest tests/test_compression_fidelity_regression.py -q → 5 passed in 0.14s.
Negative control (proves the gate has teeth): compressing logs_oom and probing for a benign row that compression legitimately drops returns recall = 0.00, lost = ['heartbeat ping 25'] — i.e. the gate fires when critical evidence is dropped, so it is not trivially green.

Weekly (non-blocking) step verified locally:

Information retention: 50/50 cases >=0.9 recall, avg compression 65.7%

Not tested: real-dataset (HotpotQA/BFCL) recall and prose/ModernBERT compression — intentionally deferred to a follow-up PR targeting the weekly job.

Review Readiness

I have performed a self-review
This PR is ready for human review

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have updated the CHANGELOG.md if applicable

Additional Notes

CHANGELOG/version intentionally untouched: repo uses release-please.
Follow-up PR (planned): wire the real HotpotQA/BFCL loaders (headroom/evals/datasets.py) into the weekly-suite job for genuine benchmark-scale recall coverage (model-allowed, non-blocking). Further follow-ups from the same design: a live per-request fidelity guardrail, query-aware lossy retention, and a hard target_tokens budget API.

…-model) Headroom's lossy compression drops rows/lines by heuristic but never checks that meaning survived — a dropped "OOM killed worker 3" line can flip a model's answer with no signal. This adds a per-PR gate that catches such regressions. Blocking gate (tests/test_compression_fidelity_regression.py): compress vendored golden tool-outputs via SmartCrusher's lossy path, then assert the evidence that answers each case's question survives. Per-case critical recall must be 1.0 (error/anomaly rows, the documented retention guarantee); aggregate recall must hold vs a committed baseline. Pure-stdlib scoring (reuses headroom.evals.metrics.compute_information_recall) — no model, no network, no secrets — so it runs in the existing [dev] shard with zero added CI setup. Fixtures (tests/fixtures/fidelity_golden/): deterministic _generate.py emits cases.json (4 cases) + baseline.json. Regenerate via the generator. Non-blocking weekly report: one step in eval.yml's weekly-suite job reuses the existing evaluate_information_retention runner (synthetic cases, production routing path). Real-dataset (HotpotQA/BFCL) recall is a documented follow-up PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-20T07:17:23Z

PR governance

This PR does not yet satisfy the required template fields:

Fill in Real Behavior Proof → Environment.
Fill in Real Behavior Proof → Exact command / steps.
Fill in Real Behavior Proof → Observed result.
Fill in Real Behavior Proof → Not tested.

Please update the PR body, or move the PR back to draft while it is still in progress.

codecov · 2026-06-20T07:24:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jun 20, 2026

agaonker mentioned this pull request Jun 20, 2026

feat(evals): weekly HotpotQA answer-recall report on the prose path #1188

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(evals): add offline fidelity regression gate (recall-based, zero-model)#1187

test(evals): add offline fidelity regression gate (recall-based, zero-model)#1187
agaonker wants to merge 1 commit into
chopratejas:mainfrom
agaonker:feat/fidelity-regression-gate

agaonker commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

codecov Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

agaonker commented Jun 20, 2026

Description

Type of Change

Changes Made

Design notes

Testing

Test Output

Real Behavior Proof

Review Readiness

Checklist

Additional Notes

Uh oh!

github-actions Bot commented Jun 20, 2026

PR governance

Uh oh!

codecov Bot commented Jun 20, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant