ds4-eval: fix four answer-grader false negatives + golden self-tests#319
ds4-eval: fix four answer-grader false negatives + golden self-tests#319rinaldofesta wants to merge 3 commits into
Conversation
|
The known limitation of the multiple-choice fix (commit 2) is now tracked separately for discussion: #321 — the extractor still grabs a distractor stated before the chosen letter ( |
Two model-free correctness fixes in the answer grader, locked by new
--self-test-extractors cases (no model/GPU needed):
- find_integer_answer: the answer-marker path took the first integer in
the window, so an answer line that shows its arithmetic
("Answer: m+n = 256+37 = 293") was graded as the first summand (256)
instead of the stated total (293) -> false negative. Reachable on the
many embedded AIME2025 "Find m+n / a+b+c" cases. The scan is now bound
to the answer line and, when it shows arithmetic, reads the value after
the last '='. "Final answer: 082" -> "82" and the loose fallback are
preserved.
- regrade_trace_file: every case with a MODEL_OUTPUT block was graded,
including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their
partial output inflated passed/failed and raised spurious "changed"
drift. Grading is now factored into regrade_case_outcome(), which only
grades PASSED/FAILED (and legacy status-less) traces; others are
reported via a new not_graded counter.
Tested: make (Metal) and make cpu build clean (-Wall -Wextra);
./ds4-eval --self-test-extractors passes; regrade of a 3-case trace
(1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0
(was passed=2 failed=1 changed=1).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
find_answer_letter returned the first boundary-isolated in-range capital
after "Answer:", so on 10-choice (A-J) cases a leading English pronoun or
article was graded as the choice:
"Answer: I think it is C" -> I (should be C)
"Answer: I'll go with C." -> I (should be C)
"Answer: A careful reading ... C" -> A (should be C)
24 embedded cases are 10-choice, so this is reachable. The forward scan
now skips a standalone capital that begins a same-line word or a
contraction; the reverse scan still recovers it when it is the only
candidate ("Answer: A is correct"), and a genuine standalone answer
("Answer: I.") is unchanged.
Known limitation: a distractor explicitly rejected *before* the chosen
letter on the same line ("... rules out C, leaving D") is still misread;
that needs sentence-level parsing and is left for discussion.
Locked by new --self-test-extractors cases (model-free). All prior
self-tests and the integer/regrade cases continue to pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves the residual multiple-choice false negative tracked in antirez#321 (follow-up to antirez#319). find_answer_letter returned the first boundary-isolated in-range capital on the answer line, so when the model rejects a distractor before stating its pick the rejected letter won: "Answer: It is not B, the answer is D" -> B (should be D) "Answer: rules out C, leaving D" -> C (should be D) The forward scan now skips an in-range capital that is the object of an explicit, adjacent rejection -- "not B", "isn't B", "rule(s/d) out C", "eliminate/reject/except E". The cue set is deliberately small and high-precision: it rewrites only clear elimination phrasing and makes no attempt at general selection-vs-rejection sentence parsing, which is the principled limit noted in antirez#321. It never looks before the answer marker or across a newline, so a leading pick is accepted before any later rejected distractor is inspected -- "Answer: D, not B" still grades D, the regression that blocks the naive "take the last letter" fix. Locked by new --self-test-extractors cases (model-free, no model/GPU), including the "D, not B" guard. All prior extractor, integer, and regrade self-tests continue to pass. Tested: make ds4-eval (Metal) and make cpu build clean (-Wall -Wextra); ./ds4-eval --self-test-extractors passes on both the Metal and CPU binaries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
d043fdd to
b2533d1
Compare
|
Folded the #321 resolution in as commit 3 ( |
The
ds4-evalanswer grader has four reachable bugs that make it under-report model accuracy — it marks correct answers wrong — on the embedded eval set. Each is fixed with a minimal change and locked by new cases in the existing model-free--self-test-extractorspath (no model or GPU needed). Split into three commits so the unambiguous fixes are separate from the multiple-choice fixes that carry a design trade-off.Closes #321.
Commit 1 — integer + regrade (unambiguous)
find_integer_answertook the first integer in the answer-marker window, so an answer line that shows its arithmetic was graded as the first summand/factor:Reachable on the many embedded AIME2025 "Find$m+n$ / $a+b+c$ " cases. The scan is now bound to the answer line and, when the line shows arithmetic, reads the value after the last
=."Final answer: 082" -> "82"and the loose fallback are preserved.regrade_trace_filegraded every case carrying aMODEL_OUTPUTblock, including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their partial output inflatedpassed/failedand raised spuriouschangeddrift. Grading is now factored intoregrade_case_outcome(), which only gradesPASSED/FAILED(and legacy status-less) traces; the rest are reported via a newnot_gradedcounter.Commit 2 — multiple choice, leading pronoun/article
find_answer_letterreturned the first boundary-isolated in-range capital afterAnswer:, so on 10-choice (A–J) cases a leading English pronoun/article became the answer:24 embedded cases are 10-choice, so this is reachable. The forward scan now skips a standalone capital that begins a same-line word or a contraction; the reverse scan still recovers it when it is the only candidate (
"Answer: A is correct"), and a genuine standalone answer ("Answer: I.") is unchanged.Commit 3 — multiple choice, distractor rejected before the pick (closes #321)
The residual case tracked in #321: when the model rejects a distractor before stating its pick on the same line, the first-valid-letter rule grabbed the rejected letter.
The forward scan now skips an in-range capital that is the object of an explicit, adjacent rejection —
not B,isn't B,rule(s/d) out C,eliminate/reject/except E. The cue set is deliberately small and high-precision: it rewrites only clear elimination phrasing and makes no attempt at general selection-vs-rejection sentence parsing — the principled limit noted in #321. It never looks before the answer marker or across a newline, so a leading pick is accepted before any later rejected distractor is inspected:That
"D, not B"case is locked as a self-test — it is the regression that rules out the naive "take the last letter on the line" fix.Testing
Each new self-test fails on
mainand passes with its fix (red → green). The four new multiple-choice cases include the"D, not B"guard. A crafted 3-case regrade trace (1PASSED, 2STOPPED) reportspassed=1 not_graded=2 changed=0after the fix (waspassed=2 failed=1 changed=1). No inference-path code is touched, so there is no speed impact. Tested on macOS / Metal (IQ2_XXS DeepSeek-V4-Flash);--self-test-extractorsalso verified on the CPU build.