ds4-eval: fix four answer-grader false negatives + golden self-tests by rinaldofesta · Pull Request #319 · antirez/ds4

rinaldofesta · 2026-06-01T14:37:14Z

The ds4-eval answer grader has four reachable bugs that make it under-report model accuracy — it marks correct answers wrong — on the embedded eval set. Each is fixed with a minimal change and locked by new cases in the existing model-free --self-test-extractors path (no model or GPU needed). Split into three commits so the unambiguous fixes are separate from the multiple-choice fixes that carry a design trade-off.

Closes #321.

Commit 1 — integer + regrade (unambiguous)

find_integer_answer took the first integer in the answer-marker window, so an answer line that shows its arithmetic was graded as the first summand/factor:

"Answer: m+n = 256+37 = 293"     ->  256   (should be 293)
"Answer: a+b+c = 12+25+25 = 62"  ->  12    (should be 62)
"Answer: 2*7 + 3*6 + ... = 81"   ->  2     (should be 81)

Reachable on the many embedded AIME2025 "Find $m+n$ / $a+b+c$" cases. The scan is now bound to the answer line and, when the line shows arithmetic, reads the value after the last =. "Final answer: 082" -> "82" and the loose fallback are preserved.

regrade_trace_file graded every case carrying a MODEL_OUTPUT block, including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their partial output inflated passed/failed and raised spurious changed drift. Grading is now factored into regrade_case_outcome(), which only grades PASSED/FAILED (and legacy status-less) traces; the rest are reported via a new not_graded counter.

Commit 2 — multiple choice, leading pronoun/article

find_answer_letter returned the first boundary-isolated in-range capital after Answer:, so on 10-choice (A–J) cases a leading English pronoun/article became the answer:

"Answer: I think it is C"          ->  I   (should be C)
"Answer: I'll go with C."          ->  I   (should be C)
"Answer: A careful reading ... C"  ->  A   (should be C)

24 embedded cases are 10-choice, so this is reachable. The forward scan now skips a standalone capital that begins a same-line word or a contraction; the reverse scan still recovers it when it is the only candidate ("Answer: A is correct"), and a genuine standalone answer ("Answer: I.") is unchanged.

Commit 3 — multiple choice, distractor rejected before the pick (closes #321)

The residual case tracked in #321: when the model rejects a distractor before stating its pick on the same line, the first-valid-letter rule grabbed the rejected letter.

"Answer: It is not B, the answer is D"   ->  B   (should be D)
"Answer: rules out C, leaving D"         ->  C   (should be D)

The forward scan now skips an in-range capital that is the object of an explicit, adjacent rejection — not B, isn't B, rule(s/d) out C, eliminate/reject/except E. The cue set is deliberately small and high-precision: it rewrites only clear elimination phrasing and makes no attempt at general selection-vs-rejection sentence parsing — the principled limit noted in #321. It never looks before the answer marker or across a newline, so a leading pick is accepted before any later rejected distractor is inspected:

"Answer: D, not B"   ->  D   (still correct)

That "D, not B" case is locked as a self-test — it is the regression that rules out the naive "take the last letter on the line" fix.

Testing

make                              # Metal: ds4_eval.c compiles clean with -Wall -Wextra
make cpu                          # builds clean (-Wall -Wextra)
./ds4-eval --self-test-extractors # passes (extractor + integer + regrade cases)

Each new self-test fails on main and passes with its fix (red → green). The four new multiple-choice cases include the "D, not B" guard. A crafted 3-case regrade trace (1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0 after the fix (was passed=2 failed=1 changed=1). No inference-path code is touched, so there is no speed impact. Tested on macOS / Metal (IQ2_XXS DeepSeek-V4-Flash); --self-test-extractors also verified on the CPU build.

rinaldofesta · 2026-06-01T14:43:19Z

The known limitation of the multiple-choice fix (commit 2) is now tracked separately for discussion: #321 — the extractor still grabs a distractor stated before the chosen letter ("Answer: rules out C, leaving D" → C). Kept out of this PR because every naive fix trades one failure mode for another; happy to fold a resolution in here if you'd prefer.

Two model-free correctness fixes in the answer grader, locked by new --self-test-extractors cases (no model/GPU needed): - find_integer_answer: the answer-marker path took the first integer in the window, so an answer line that shows its arithmetic ("Answer: m+n = 256+37 = 293") was graded as the first summand (256) instead of the stated total (293) -> false negative. Reachable on the many embedded AIME2025 "Find m+n / a+b+c" cases. The scan is now bound to the answer line and, when it shows arithmetic, reads the value after the last '='. "Final answer: 082" -> "82" and the loose fallback are preserved. - regrade_trace_file: every case with a MODEL_OUTPUT block was graded, including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their partial output inflated passed/failed and raised spurious "changed" drift. Grading is now factored into regrade_case_outcome(), which only grades PASSED/FAILED (and legacy status-less) traces; others are reported via a new not_graded counter. Tested: make (Metal) and make cpu build clean (-Wall -Wextra); ./ds4-eval --self-test-extractors passes; regrade of a 3-case trace (1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0 (was passed=2 failed=1 changed=1). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

find_answer_letter returned the first boundary-isolated in-range capital after "Answer:", so on 10-choice (A-J) cases a leading English pronoun or article was graded as the choice: "Answer: I think it is C" -> I (should be C) "Answer: I'll go with C." -> I (should be C) "Answer: A careful reading ... C" -> A (should be C) 24 embedded cases are 10-choice, so this is reachable. The forward scan now skips a standalone capital that begins a same-line word or a contraction; the reverse scan still recovers it when it is the only candidate ("Answer: A is correct"), and a genuine standalone answer ("Answer: I.") is unchanged. Known limitation: a distractor explicitly rejected *before* the chosen letter on the same line ("... rules out C, leaving D") is still misread; that needs sentence-level parsing and is left for discussion. Locked by new --self-test-extractors cases (model-free). All prior self-tests and the integer/regrade cases continue to pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Resolves the residual multiple-choice false negative tracked in antirez#321 (follow-up to antirez#319). find_answer_letter returned the first boundary-isolated in-range capital on the answer line, so when the model rejects a distractor before stating its pick the rejected letter won: "Answer: It is not B, the answer is D" -> B (should be D) "Answer: rules out C, leaving D" -> C (should be D) The forward scan now skips an in-range capital that is the object of an explicit, adjacent rejection -- "not B", "isn't B", "rule(s/d) out C", "eliminate/reject/except E". The cue set is deliberately small and high-precision: it rewrites only clear elimination phrasing and makes no attempt at general selection-vs-rejection sentence parsing, which is the principled limit noted in antirez#321. It never looks before the answer marker or across a newline, so a leading pick is accepted before any later rejected distractor is inspected -- "Answer: D, not B" still grades D, the regression that blocks the naive "take the last letter" fix. Locked by new --self-test-extractors cases (model-free, no model/GPU), including the "D, not B" guard. All prior extractor, integer, and regrade self-tests continue to pass. Tested: make ds4-eval (Metal) and make cpu build clean (-Wall -Wextra); ./ds4-eval --self-test-extractors passes on both the Metal and CPU binaries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

rinaldofesta · 2026-06-08T05:59:00Z

Folded the #321 resolution in as commit 3 (b2533d1) and rebased the branch on latest main (clean — the new ds4_eval.c changes are in option parsing / main(), no overlap with the extractor). The forward scan now skips an in-range capital that is the object of an explicit adjacent rejection (not B, isn't B, rule(s/d) out C); "Answer: D, not B" -> D is locked as a self-test so the leading pick still wins. Cue set kept intentionally small — no general sentence parsing. ./ds4-eval --self-test-extractors passes (Metal + CPU). Happy to drop the commit back out if you'd rather keep the lexical extractor strictly negation-free.

rinaldofesta mentioned this pull request Jun 1, 2026

ds4-eval: answer-letter extractor mis-grades a distractor stated before the chosen option (follow-up to #319) #321

Open

rinaldofesta and others added 3 commits June 8, 2026 07:57

rinaldofesta force-pushed the fix/eval-grader-false-negatives branch from d043fdd to b2533d1 Compare June 8, 2026 05:57

rinaldofesta changed the title ~~ds4-eval: fix three answer-grader false negatives + golden self-tests~~ ds4-eval: fix four answer-grader false negatives + golden self-tests Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ds4-eval: fix four answer-grader false negatives + golden self-tests#319

ds4-eval: fix four answer-grader false negatives + golden self-tests#319
rinaldofesta wants to merge 3 commits into
antirez:mainfrom
rinaldofesta:fix/eval-grader-false-negatives

rinaldofesta commented Jun 1, 2026 •

edited

Loading

Uh oh!

rinaldofesta commented Jun 1, 2026

Uh oh!

rinaldofesta commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rinaldofesta commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commit 1 — integer + regrade (unambiguous)

Commit 2 — multiple choice, leading pronoun/article

Commit 3 — multiple choice, distractor rejected before the pick (closes #321)

Testing

Uh oh!

rinaldofesta commented Jun 1, 2026

Uh oh!

rinaldofesta commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rinaldofesta commented Jun 1, 2026 •

edited

Loading