feat(evals): weekly HotpotQA answer-recall report on the prose path#1188
Open
agaonker wants to merge 2 commits into
Open
feat(evals): weekly HotpotQA answer-recall report on the prose path#1188agaonker wants to merge 2 commits into
agaonker wants to merge 2 commits into
Conversation
…-model) Headroom's lossy compression drops rows/lines by heuristic but never checks that meaning survived — a dropped "OOM killed worker 3" line can flip a model's answer with no signal. This adds a per-PR gate that catches such regressions. Blocking gate (tests/test_compression_fidelity_regression.py): compress vendored golden tool-outputs via SmartCrusher's lossy path, then assert the evidence that answers each case's question survives. Per-case critical recall must be 1.0 (error/anomaly rows, the documented retention guarantee); aggregate recall must hold vs a committed baseline. Pure-stdlib scoring (reuses headroom.evals.metrics.compute_information_recall) — no model, no network, no secrets — so it runs in the existing [dev] shard with zero added CI setup. Fixtures (tests/fixtures/fidelity_golden/): deterministic _generate.py emits cases.json (4 cases) + baseline.json. Regenerate via the generator. Non-blocking weekly report: one step in eval.yml's weekly-suite job reuses the existing evaluate_information_retention runner (synthetic cases, production routing path). Real-dataset (HotpotQA/BFCL) recall is a documented follow-up PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up to the offline fidelity gate (chopratejas#1187). That gate is hermetic and structured-only; this adds genuine prose coverage in the model-allowed weekly job, where compression routes through Kompress (ModernBERT). Adds CompressionOnlyRunner.evaluate_dataset_recall(suite): for each QA case, compress the supporting context via the production routing path and check the ground-truth answer survives (compute_information_recall). Counts only probeable cases (answer literally present, non-trivial), so the aggregate is meaningful. Wires a non-blocking step into eval.yml's weekly-suite that drives it with load_hotpotqa(n=50); dataset/model failures warn rather than fail the job. A hermetic unit test exercises the method with synthetic JSON-array contexts (SmartCrusher/Rust, no model) so it runs in the [dev] shard. BFCL/tool-schema integrity is already covered by evaluate_tool_schema_compaction; this targets the previously-uncovered prose recall path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
PR governanceThis PR does not yet satisfy the required template fields:
Please update the PR body, or move the PR back to draft while it is still in progress. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Follow-up to #1187 (the offline fidelity gate). That gate is hermetic and structured-only (JSON tool outputs via Rust compressors) so it can block every PR with zero setup. This PR adds the genuinely-uncovered piece: prose answer-recall on a real dataset (HotpotQA) in the model-allowed weekly job, where compression routes through Kompress (ModernBERT).
Closes #
Type of Change
Changes Made
CompressionOnlyRunner.evaluate_dataset_recall(suite): for each QA case, compress the supportingcontextvia the production routing path (ContentRouter) and check theground_truthanswer survives (compute_information_recall). Counts only probeable cases — answer literally present in the context and non-trivial (skipsyes/no, too-short) — so the aggregate is meaningful rather than inflated by un-measurable cases..github/workflows/eval.yml: a non-blocking step in the existingweekly-suitejob (schedule/manual only) drives it withload_hotpotqa(n=50). Defensive: a dataset download or model failure emits::warning::and|| true, never failing the job.tests/test_dataset_recall_runner.py): exercises the method with synthetic JSON-array contexts (SmartCrusher / Rust — no model, no network), so it runs in the standard[dev]shard.Scope notes
evaluate_tool_schema_compactioneval (which runs in the PR smoke-test), so this targets the previously-uncovered prose recall path. NQ is an easy further extension using the same method +load_natural_questions.weekly-suitejob already installs[all]and genuinely runs every Monday (verified: 5 consecutive successful scheduled runs), so it's the correct home — keeping PR CI fast and hermetic.Testing
pytest)ruff check .)mypy headroom)Test Output
Real Behavior Proof
feat/weekly-dataset-recall,pip install -e ".[dev]",HF_HUB_OFFLINE=1(proves the unit test needs no model/network).HF_HUB_OFFLINE=1 python -m pytest tests/test_dataset_recall_runner.py -q→2 passed in 0.20s.yes, one absent answer),evaluate_dataset_recallcorrectly counts only the 1 probeable case,passed=1,accuracy_rate=1.0,benchmark="dataset_recall:synthetic".weekly-suitejob under theschedule || workflow_dispatchguard (confirmed viayaml.safe_load).workflow_dispatch), by design.Review Readiness
Checklist
Additional Notes