Skip to content

test(evals): add offline fidelity regression gate (recall-based, zero-model)#1187

Open
agaonker wants to merge 1 commit into
chopratejas:mainfrom
agaonker:feat/fidelity-regression-gate
Open

test(evals): add offline fidelity regression gate (recall-based, zero-model)#1187
agaonker wants to merge 1 commit into
chopratejas:mainfrom
agaonker:feat/fidelity-regression-gate

Conversation

@agaonker

Copy link
Copy Markdown
Contributor

Description

Headroom's lossy compression drops rows/lines using statistical heuristics but never checks that meaning survived — a dropped OOM killed worker 3 line can silently flip a model's answer with no signal that compression caused it. The repo already ships a quality-metric toolkit (headroom/evals/metrics.py) and a weekly-suite eval job, but neither gates the compression path on a PR.

This adds a per-PR fidelity regression gate: compress vendored golden tool-outputs through SmartCrusher's lossy path and assert the evidence that answers each case's question survives. It is the first of a planned trio (this is the "offline gate" half of the fidelity work); query-aware retention and a hard token-budget API are documented follow-ups.

Closes #

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • Blocking gate (tests/test_compression_fidelity_regression.py): compresses each golden case via smart_crush_tool_output(..., with_compaction=False) and scores with compute_information_recall. Two assertions:
    • Per-case critical recall == 1.0 — every answer_evidence string (placed in error/anomaly rows, the documented SmartCrusher retention guarantee) must survive.
    • Aggregate recall ≥ committed baseline (baseline.json, tol 0.02) — catches softer regressions.
  • Vendored fixtures (tests/fixtures/fidelity_golden/): deterministic _generate.py emits cases.json (4 cases: OOM crash, payment exception, latency anomaly, CI failure) + baseline.json.
  • Non-blocking weekly report (.github/workflows/eval.yml): one step in the existing weekly-suite job (schedule/manual only) reuses the existing evaluate_information_retention runner for a recall report on the production routing path.
  • Pure reuse: scoring (evals/metrics.py), compressor (smart_crush_tool_output), and the weekly runner (evaluate_information_retention) all already existed.

Design notes

  • Zero new CI setup. The blocking gate runs in the existing [dev] test shard — no new workflow, no new deps, no model, no network, no secrets (verified under HF_HUB_OFFLINE=1). It deliberately uses small hand-made structured fixtures rather than the repo's HuggingFace dataset loaders, which would require a network download + ModernBERT and don't belong in a fast PR gate.
  • Scope: structured JSON tool-output (the dominant, deterministic, model-free case). Real-dataset (HotpotQA/BFCL) recall — which needs [all] + a local model — is a documented follow-up PR, and the weekly-suite job (which genuinely runs every Monday) is its natural home.

Testing

  • Unit tests pass (pytest)
  • Linting passes (ruff check .)
  • Type checking passes (mypy headroom)
  • New tests added for new functionality
  • Manual testing performed

Test Output

$ HF_HUB_OFFLINE=1 python -m pytest tests/test_compression_fidelity_regression.py -v
tests/test_compression_fidelity_regression.py::test_critical_evidence_survives_compression[logs_oom] PASSED
tests/test_compression_fidelity_regression.py::test_critical_evidence_survives_compression[payment_exception] PASSED
tests/test_compression_fidelity_regression.py::test_critical_evidence_survives_compression[latency_anomaly] PASSED
tests/test_compression_fidelity_regression.py::test_critical_evidence_survives_compression[ci_test_failures] PASSED
tests/test_compression_fidelity_regression.py::test_aggregate_recall_not_regressed PASSED
============================== 5 passed in 0.18s ===============================

Real Behavior Proof

  • Environment: local checkout of feat/fidelity-regression-gate, pip install -e ".[dev]", HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 (proves no model/network).
  • Exact command / steps: HF_HUB_OFFLINE=1 python -m pytest tests/test_compression_fidelity_regression.py -q5 passed in 0.14s.
  • Negative control (proves the gate has teeth): compressing logs_oom and probing for a benign row that compression legitimately drops returns recall = 0.00, lost = ['heartbeat ping 25'] — i.e. the gate fires when critical evidence is dropped, so it is not trivially green.
  • Weekly (non-blocking) step verified locally:
    Information retention: 50/50 cases >=0.9 recall, avg compression 65.7%
    
  • Not tested: real-dataset (HotpotQA/BFCL) recall and prose/ModernBERT compression — intentionally deferred to a follow-up PR targeting the weekly job.

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md if applicable

Additional Notes

  • CHANGELOG/version intentionally untouched: repo uses release-please.
  • Follow-up PR (planned): wire the real HotpotQA/BFCL loaders (headroom/evals/datasets.py) into the weekly-suite job for genuine benchmark-scale recall coverage (model-allowed, non-blocking). Further follow-ups from the same design: a live per-request fidelity guardrail, query-aware lossy retention, and a hard target_tokens budget API.

…-model)

Headroom's lossy compression drops rows/lines by heuristic but never checks
that meaning survived — a dropped "OOM killed worker 3" line can flip a model's
answer with no signal. This adds a per-PR gate that catches such regressions.

Blocking gate (tests/test_compression_fidelity_regression.py): compress vendored
golden tool-outputs via SmartCrusher's lossy path, then assert the evidence that
answers each case's question survives. Per-case critical recall must be 1.0
(error/anomaly rows, the documented retention guarantee); aggregate recall must
hold vs a committed baseline. Pure-stdlib scoring (reuses
headroom.evals.metrics.compute_information_recall) — no model, no network, no
secrets — so it runs in the existing [dev] shard with zero added CI setup.

Fixtures (tests/fixtures/fidelity_golden/): deterministic _generate.py emits
cases.json (4 cases) + baseline.json. Regenerate via the generator.

Non-blocking weekly report: one step in eval.yml's weekly-suite job reuses the
existing evaluate_information_retention runner (synthetic cases, production
routing path). Real-dataset (HotpotQA/BFCL) recall is a documented follow-up PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

PR governance

This PR does not yet satisfy the required template fields:

  • Fill in Real Behavior ProofEnvironment.
  • Fill in Real Behavior ProofExact command / steps.
  • Fill in Real Behavior ProofObserved result.
  • Fill in Real Behavior ProofNot tested.

Please update the PR body, or move the PR back to draft while it is still in progress.

@github-actions github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jun 20, 2026
@codecov

codecov Bot commented Jun 20, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: needs author action Pull request body or readiness checklist still needs author updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant