Skip to content

feat(evals): weekly HotpotQA answer-recall report on the prose path#1188

Open
agaonker wants to merge 2 commits into
chopratejas:mainfrom
agaonker:feat/weekly-dataset-recall
Open

feat(evals): weekly HotpotQA answer-recall report on the prose path#1188
agaonker wants to merge 2 commits into
chopratejas:mainfrom
agaonker:feat/weekly-dataset-recall

Conversation

@agaonker

Copy link
Copy Markdown
Contributor

Description

Follow-up to #1187 (the offline fidelity gate). That gate is hermetic and structured-only (JSON tool outputs via Rust compressors) so it can block every PR with zero setup. This PR adds the genuinely-uncovered piece: prose answer-recall on a real dataset (HotpotQA) in the model-allowed weekly job, where compression routes through Kompress (ModernBERT).

Stacked on #1187. Until that merges, this PR's diff shows its commit too; it reduces to just c71cc0cb once #1187 lands. Please review/merge #1187 first.

Closes #

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • CompressionOnlyRunner.evaluate_dataset_recall(suite): for each QA case, compress the supporting context via the production routing path (ContentRouter) and check the ground_truth answer survives (compute_information_recall). Counts only probeable cases — answer literally present in the context and non-trivial (skips yes/no, too-short) — so the aggregate is meaningful rather than inflated by un-measurable cases.
  • .github/workflows/eval.yml: a non-blocking step in the existing weekly-suite job (schedule/manual only) drives it with load_hotpotqa(n=50). Defensive: a dataset download or model failure emits ::warning:: and || true, never failing the job.
  • Hermetic unit test (tests/test_dataset_recall_runner.py): exercises the method with synthetic JSON-array contexts (SmartCrusher / Rust — no model, no network), so it runs in the standard [dev] shard.

Scope notes

  • Prose path only. BFCL / tool-schema integrity is already covered by the existing evaluate_tool_schema_compaction eval (which runs in the PR smoke-test), so this targets the previously-uncovered prose recall path. NQ is an easy further extension using the same method + load_natural_questions.
  • Why weekly, not per-PR. Real datasets need a network download + the ModernBERT model. The weekly-suite job already installs [all] and genuinely runs every Monday (verified: 5 consecutive successful scheduled runs), so it's the correct home — keeping PR CI fast and hermetic.

Testing

  • Unit tests pass (pytest)
  • Linting passes (ruff check .)
  • Type checking passes (mypy headroom)
  • New tests added for new functionality
  • Manual testing performed

Test Output

$ HF_HUB_OFFLINE=1 python -m pytest tests/test_dataset_recall_runner.py -q
..                                                                       [100%]
2 passed in 0.20s

Real Behavior Proof

  • Environment: local checkout of feat/weekly-dataset-recall, pip install -e ".[dev]", HF_HUB_OFFLINE=1 (proves the unit test needs no model/network).
  • Exact command / steps: HF_HUB_OFFLINE=1 python -m pytest tests/test_dataset_recall_runner.py -q2 passed in 0.20s.
  • Observed behavior of the new method: with a synthetic suite of 3 cases (one probeable answer in an error row, one trivial yes, one absent answer), evaluate_dataset_recall correctly counts only the 1 probeable case, passed=1, accuracy_rate=1.0, benchmark="dataset_recall:synthetic".
  • YAML validated: the new step parses and lands in the weekly-suite job under the schedule || workflow_dispatch guard (confirmed via yaml.safe_load).
  • Not run here: the live HotpotQA download + ModernBERT compression — exercised only by the weekly job (or workflow_dispatch), by design.

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md if applicable

Additional Notes

  • CHANGELOG/version intentionally untouched: repo uses release-please.
  • The weekly job can be triggered on demand via workflow_dispatch to see the HotpotQA recall numbers without waiting for Monday.

agaonker and others added 2 commits June 20, 2026 00:11
…-model)

Headroom's lossy compression drops rows/lines by heuristic but never checks
that meaning survived — a dropped "OOM killed worker 3" line can flip a model's
answer with no signal. This adds a per-PR gate that catches such regressions.

Blocking gate (tests/test_compression_fidelity_regression.py): compress vendored
golden tool-outputs via SmartCrusher's lossy path, then assert the evidence that
answers each case's question survives. Per-case critical recall must be 1.0
(error/anomaly rows, the documented retention guarantee); aggregate recall must
hold vs a committed baseline. Pure-stdlib scoring (reuses
headroom.evals.metrics.compute_information_recall) — no model, no network, no
secrets — so it runs in the existing [dev] shard with zero added CI setup.

Fixtures (tests/fixtures/fidelity_golden/): deterministic _generate.py emits
cases.json (4 cases) + baseline.json. Regenerate via the generator.

Non-blocking weekly report: one step in eval.yml's weekly-suite job reuses the
existing evaluate_information_retention runner (synthetic cases, production
routing path). Real-dataset (HotpotQA/BFCL) recall is a documented follow-up PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up to the offline fidelity gate (chopratejas#1187). That gate is hermetic and
structured-only; this adds genuine prose coverage in the model-allowed weekly
job, where compression routes through Kompress (ModernBERT).

Adds CompressionOnlyRunner.evaluate_dataset_recall(suite): for each QA case,
compress the supporting context via the production routing path and check the
ground-truth answer survives (compute_information_recall). Counts only probeable
cases (answer literally present, non-trivial), so the aggregate is meaningful.

Wires a non-blocking step into eval.yml's weekly-suite that drives it with
load_hotpotqa(n=50); dataset/model failures warn rather than fail the job.
A hermetic unit test exercises the method with synthetic JSON-array contexts
(SmartCrusher/Rust, no model) so it runs in the [dev] shard.

BFCL/tool-schema integrity is already covered by evaluate_tool_schema_compaction;
this targets the previously-uncovered prose recall path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

PR governance

This PR does not yet satisfy the required template fields:

  • Fill in Real Behavior ProofEnvironment.
  • Fill in Real Behavior ProofExact command / steps.
  • Fill in Real Behavior ProofObserved result.
  • Fill in Real Behavior ProofNot tested.

Please update the PR body, or move the PR back to draft while it is still in progress.

@github-actions github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jun 20, 2026
@codecov

codecov Bot commented Jun 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.18919% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
headroom/evals/runners/compression_only.py 89.18% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: needs author action Pull request body or readiness checklist still needs author updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant