feat(evals): weekly HotpotQA answer-recall report on the prose path by agaonker · Pull Request #1188 · chopratejas/headroom

agaonker · 2026-06-20T07:30:59Z

Description

Follow-up to #1187 (the offline fidelity gate). That gate is hermetic and structured-only (JSON tool outputs via Rust compressors) so it can block every PR with zero setup. This PR adds the genuinely-uncovered piece: prose answer-recall on a real dataset (HotpotQA) in the model-allowed weekly job, where compression routes through Kompress (ModernBERT).

Stacked on #1187. Until that merges, this PR's diff shows its commit too; it reduces to just c71cc0cb once #1187 lands. Please review/merge #1187 first.

Closes #

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Performance improvement
Code refactoring (no functional changes)

Changes Made

CompressionOnlyRunner.evaluate_dataset_recall(suite): for each QA case, compress the supporting context via the production routing path (ContentRouter) and check the ground_truth answer survives (compute_information_recall). Counts only probeable cases — answer literally present in the context and non-trivial (skips yes/no, too-short) — so the aggregate is meaningful rather than inflated by un-measurable cases.
.github/workflows/eval.yml: a non-blocking step in the existing weekly-suite job (schedule/manual only) drives it with load_hotpotqa(n=50). Defensive: a dataset download or model failure emits ::warning:: and || true, never failing the job.
Hermetic unit test (tests/test_dataset_recall_runner.py): exercises the method with synthetic JSON-array contexts (SmartCrusher / Rust — no model, no network), so it runs in the standard [dev] shard.

Scope notes

Prose path only. BFCL / tool-schema integrity is already covered by the existing evaluate_tool_schema_compaction eval (which runs in the PR smoke-test), so this targets the previously-uncovered prose recall path. NQ is an easy further extension using the same method + load_natural_questions.
Why weekly, not per-PR. Real datasets need a network download + the ModernBERT model. The weekly-suite job already installs [all] and genuinely runs every Monday (verified: 5 consecutive successful scheduled runs), so it's the correct home — keeping PR CI fast and hermetic.

Testing

Unit tests pass (pytest)
Linting passes (ruff check .)
Type checking passes (mypy headroom)
New tests added for new functionality
Manual testing performed

Test Output

$ HF_HUB_OFFLINE=1 python -m pytest tests/test_dataset_recall_runner.py -q
..                                                                       [100%]
2 passed in 0.20s

Real Behavior Proof

Environment: local checkout of feat/weekly-dataset-recall, pip install -e ".[dev]", HF_HUB_OFFLINE=1 (proves the unit test needs no model/network).
Exact command / steps: HF_HUB_OFFLINE=1 python -m pytest tests/test_dataset_recall_runner.py -q → 2 passed in 0.20s.
Observed behavior of the new method: with a synthetic suite of 3 cases (one probeable answer in an error row, one trivial yes, one absent answer), evaluate_dataset_recall correctly counts only the 1 probeable case, passed=1, accuracy_rate=1.0, benchmark="dataset_recall:synthetic".
YAML validated: the new step parses and lands in the weekly-suite job under the schedule || workflow_dispatch guard (confirmed via yaml.safe_load).
Not run here: the live HotpotQA download + ModernBERT compression — exercised only by the weekly job (or workflow_dispatch), by design.

Review Readiness

I have performed a self-review
This PR is ready for human review

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have updated the CHANGELOG.md if applicable

Additional Notes

CHANGELOG/version intentionally untouched: repo uses release-please.
The weekly job can be triggered on demand via workflow_dispatch to see the HotpotQA recall numbers without waiting for Monday.

…-model) Headroom's lossy compression drops rows/lines by heuristic but never checks that meaning survived — a dropped "OOM killed worker 3" line can flip a model's answer with no signal. This adds a per-PR gate that catches such regressions. Blocking gate (tests/test_compression_fidelity_regression.py): compress vendored golden tool-outputs via SmartCrusher's lossy path, then assert the evidence that answers each case's question survives. Per-case critical recall must be 1.0 (error/anomaly rows, the documented retention guarantee); aggregate recall must hold vs a committed baseline. Pure-stdlib scoring (reuses headroom.evals.metrics.compute_information_recall) — no model, no network, no secrets — so it runs in the existing [dev] shard with zero added CI setup. Fixtures (tests/fixtures/fidelity_golden/): deterministic _generate.py emits cases.json (4 cases) + baseline.json. Regenerate via the generator. Non-blocking weekly report: one step in eval.yml's weekly-suite job reuses the existing evaluate_information_retention runner (synthetic cases, production routing path). Real-dataset (HotpotQA/BFCL) recall is a documented follow-up PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Follow-up to the offline fidelity gate (chopratejas#1187). That gate is hermetic and structured-only; this adds genuine prose coverage in the model-allowed weekly job, where compression routes through Kompress (ModernBERT). Adds CompressionOnlyRunner.evaluate_dataset_recall(suite): for each QA case, compress the supporting context via the production routing path and check the ground-truth answer survives (compute_information_recall). Counts only probeable cases (answer literally present, non-trivial), so the aggregate is meaningful. Wires a non-blocking step into eval.yml's weekly-suite that drives it with load_hotpotqa(n=50); dataset/model failures warn rather than fail the job. A hermetic unit test exercises the method with synthetic JSON-array contexts (SmartCrusher/Rust, no model) so it runs in the [dev] shard. BFCL/tool-schema integrity is already covered by evaluate_tool_schema_compaction; this targets the previously-uncovered prose recall path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-20T07:31:10Z

PR governance

This PR does not yet satisfy the required template fields:

Fill in Real Behavior Proof → Environment.
Fill in Real Behavior Proof → Exact command / steps.
Fill in Real Behavior Proof → Observed result.
Fill in Real Behavior Proof → Not tested.

Please update the PR body, or move the PR back to draft while it is still in progress.

codecov · 2026-06-20T07:38:32Z

Codecov Report

❌ Patch coverage is 89.18919% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
headroom/evals/runners/compression_only.py	89.18%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

agaonker and others added 2 commits June 20, 2026 00:11

github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evals): weekly HotpotQA answer-recall report on the prose path#1188

feat(evals): weekly HotpotQA answer-recall report on the prose path#1188
agaonker wants to merge 2 commits into
chopratejas:mainfrom
agaonker:feat/weekly-dataset-recall

agaonker commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

codecov Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

agaonker commented Jun 20, 2026

Description

Type of Change

Changes Made

Scope notes

Testing

Test Output

Real Behavior Proof

Review Readiness

Checklist

Additional Notes

Uh oh!

github-actions Bot commented Jun 20, 2026

PR governance

Uh oh!

codecov Bot commented Jun 20, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant