test(evals): add offline fidelity regression gate (recall-based, zero-model)#1187
Open
agaonker wants to merge 1 commit into
Open
test(evals): add offline fidelity regression gate (recall-based, zero-model)#1187agaonker wants to merge 1 commit into
agaonker wants to merge 1 commit into
Conversation
…-model) Headroom's lossy compression drops rows/lines by heuristic but never checks that meaning survived — a dropped "OOM killed worker 3" line can flip a model's answer with no signal. This adds a per-PR gate that catches such regressions. Blocking gate (tests/test_compression_fidelity_regression.py): compress vendored golden tool-outputs via SmartCrusher's lossy path, then assert the evidence that answers each case's question survives. Per-case critical recall must be 1.0 (error/anomaly rows, the documented retention guarantee); aggregate recall must hold vs a committed baseline. Pure-stdlib scoring (reuses headroom.evals.metrics.compute_information_recall) — no model, no network, no secrets — so it runs in the existing [dev] shard with zero added CI setup. Fixtures (tests/fixtures/fidelity_golden/): deterministic _generate.py emits cases.json (4 cases) + baseline.json. Regenerate via the generator. Non-blocking weekly report: one step in eval.yml's weekly-suite job reuses the existing evaluate_information_retention runner (synthetic cases, production routing path). Real-dataset (HotpotQA/BFCL) recall is a documented follow-up PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
PR governanceThis PR does not yet satisfy the required template fields:
Please update the PR body, or move the PR back to draft while it is still in progress. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
21 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Headroom's lossy compression drops rows/lines using statistical heuristics but never checks that meaning survived — a dropped
OOM killed worker 3line can silently flip a model's answer with no signal that compression caused it. The repo already ships a quality-metric toolkit (headroom/evals/metrics.py) and aweekly-suiteeval job, but neither gates the compression path on a PR.This adds a per-PR fidelity regression gate: compress vendored golden tool-outputs through SmartCrusher's lossy path and assert the evidence that answers each case's question survives. It is the first of a planned trio (this is the "offline gate" half of the fidelity work); query-aware retention and a hard token-budget API are documented follow-ups.
Closes #
Type of Change
Changes Made
tests/test_compression_fidelity_regression.py): compresses each golden case viasmart_crush_tool_output(..., with_compaction=False)and scores withcompute_information_recall. Two assertions:answer_evidencestring (placed in error/anomaly rows, the documented SmartCrusher retention guarantee) must survive.baseline.json, tol 0.02) — catches softer regressions.tests/fixtures/fidelity_golden/): deterministic_generate.pyemitscases.json(4 cases: OOM crash, payment exception, latency anomaly, CI failure) +baseline.json..github/workflows/eval.yml): one step in the existingweekly-suitejob (schedule/manual only) reuses the existingevaluate_information_retentionrunner for a recall report on the production routing path.evals/metrics.py), compressor (smart_crush_tool_output), and the weekly runner (evaluate_information_retention) all already existed.Design notes
[dev]test shard — no new workflow, no new deps, no model, no network, no secrets (verified underHF_HUB_OFFLINE=1). It deliberately uses small hand-made structured fixtures rather than the repo's HuggingFace dataset loaders, which would require a network download + ModernBERT and don't belong in a fast PR gate.[all]+ a local model — is a documented follow-up PR, and theweekly-suitejob (which genuinely runs every Monday) is its natural home.Testing
pytest)ruff check .)mypy headroom)Test Output
Real Behavior Proof
feat/fidelity-regression-gate,pip install -e ".[dev]",HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1(proves no model/network).HF_HUB_OFFLINE=1 python -m pytest tests/test_compression_fidelity_regression.py -q→5 passed in 0.14s.logs_oomand probing for a benign row that compression legitimately drops returnsrecall = 0.00, lost = ['heartbeat ping 25']— i.e. the gate fires when critical evidence is dropped, so it is not trivially green.Review Readiness
Checklist
Additional Notes
headroom/evals/datasets.py) into theweekly-suitejob for genuine benchmark-scale recall coverage (model-allowed, non-blocking). Further follow-ups from the same design: a live per-request fidelity guardrail, query-aware lossy retention, and a hardtarget_tokensbudget API.