Skip to content

ds4-eval: fix four answer-grader false negatives + golden self-tests#319

Open
rinaldofesta wants to merge 3 commits into
antirez:mainfrom
rinaldofesta:fix/eval-grader-false-negatives
Open

ds4-eval: fix four answer-grader false negatives + golden self-tests#319
rinaldofesta wants to merge 3 commits into
antirez:mainfrom
rinaldofesta:fix/eval-grader-false-negatives

Conversation

@rinaldofesta
Copy link
Copy Markdown

@rinaldofesta rinaldofesta commented Jun 1, 2026

The ds4-eval answer grader has four reachable bugs that make it under-report model accuracy — it marks correct answers wrong — on the embedded eval set. Each is fixed with a minimal change and locked by new cases in the existing model-free --self-test-extractors path (no model or GPU needed). Split into three commits so the unambiguous fixes are separate from the multiple-choice fixes that carry a design trade-off.

Closes #321.

Commit 1 — integer + regrade (unambiguous)

find_integer_answer took the first integer in the answer-marker window, so an answer line that shows its arithmetic was graded as the first summand/factor:

"Answer: m+n = 256+37 = 293"     ->  256   (should be 293)
"Answer: a+b+c = 12+25+25 = 62"  ->  12    (should be 62)
"Answer: 2*7 + 3*6 + ... = 81"   ->  2     (should be 81)

Reachable on the many embedded AIME2025 "Find $m+n$ / $a+b+c$" cases. The scan is now bound to the answer line and, when the line shows arithmetic, reads the value after the last =. "Final answer: 082" -> "82" and the loose fallback are preserved.

regrade_trace_file graded every case carrying a MODEL_OUTPUT block, including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their partial output inflated passed/failed and raised spurious changed drift. Grading is now factored into regrade_case_outcome(), which only grades PASSED/FAILED (and legacy status-less) traces; the rest are reported via a new not_graded counter.

Commit 2 — multiple choice, leading pronoun/article

find_answer_letter returned the first boundary-isolated in-range capital after Answer:, so on 10-choice (A–J) cases a leading English pronoun/article became the answer:

"Answer: I think it is C"          ->  I   (should be C)
"Answer: I'll go with C."          ->  I   (should be C)
"Answer: A careful reading ... C"  ->  A   (should be C)

24 embedded cases are 10-choice, so this is reachable. The forward scan now skips a standalone capital that begins a same-line word or a contraction; the reverse scan still recovers it when it is the only candidate ("Answer: A is correct"), and a genuine standalone answer ("Answer: I.") is unchanged.

Commit 3 — multiple choice, distractor rejected before the pick (closes #321)

The residual case tracked in #321: when the model rejects a distractor before stating its pick on the same line, the first-valid-letter rule grabbed the rejected letter.

"Answer: It is not B, the answer is D"   ->  B   (should be D)
"Answer: rules out C, leaving D"         ->  C   (should be D)

The forward scan now skips an in-range capital that is the object of an explicit, adjacent rejection — not B, isn't B, rule(s/d) out C, eliminate/reject/except E. The cue set is deliberately small and high-precision: it rewrites only clear elimination phrasing and makes no attempt at general selection-vs-rejection sentence parsing — the principled limit noted in #321. It never looks before the answer marker or across a newline, so a leading pick is accepted before any later rejected distractor is inspected:

"Answer: D, not B"   ->  D   (still correct)

That "D, not B" case is locked as a self-test — it is the regression that rules out the naive "take the last letter on the line" fix.

Testing

make                              # Metal: ds4_eval.c compiles clean with -Wall -Wextra
make cpu                          # builds clean (-Wall -Wextra)
./ds4-eval --self-test-extractors # passes (extractor + integer + regrade cases)

Each new self-test fails on main and passes with its fix (red → green). The four new multiple-choice cases include the "D, not B" guard. A crafted 3-case regrade trace (1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0 after the fix (was passed=2 failed=1 changed=1). No inference-path code is touched, so there is no speed impact. Tested on macOS / Metal (IQ2_XXS DeepSeek-V4-Flash); --self-test-extractors also verified on the CPU build.

@rinaldofesta
Copy link
Copy Markdown
Author

The known limitation of the multiple-choice fix (commit 2) is now tracked separately for discussion: #321 — the extractor still grabs a distractor stated before the chosen letter ("Answer: rules out C, leaving D"C). Kept out of this PR because every naive fix trades one failure mode for another; happy to fold a resolution in here if you'd prefer.

rinaldofesta and others added 3 commits June 8, 2026 07:57
Two model-free correctness fixes in the answer grader, locked by new
--self-test-extractors cases (no model/GPU needed):

- find_integer_answer: the answer-marker path took the first integer in
  the window, so an answer line that shows its arithmetic
  ("Answer: m+n = 256+37 = 293") was graded as the first summand (256)
  instead of the stated total (293) -> false negative. Reachable on the
  many embedded AIME2025 "Find m+n / a+b+c" cases. The scan is now bound
  to the answer line and, when it shows arithmetic, reads the value after
  the last '='. "Final answer: 082" -> "82" and the loose fallback are
  preserved.

- regrade_trace_file: every case with a MODEL_OUTPUT block was graded,
  including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their
  partial output inflated passed/failed and raised spurious "changed"
  drift. Grading is now factored into regrade_case_outcome(), which only
  grades PASSED/FAILED (and legacy status-less) traces; others are
  reported via a new not_graded counter.

Tested: make (Metal) and make cpu build clean (-Wall -Wextra);
./ds4-eval --self-test-extractors passes; regrade of a 3-case trace
(1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0
(was passed=2 failed=1 changed=1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
find_answer_letter returned the first boundary-isolated in-range capital
after "Answer:", so on 10-choice (A-J) cases a leading English pronoun or
article was graded as the choice:

    "Answer: I think it is C"        -> I   (should be C)
    "Answer: I'll go with C."        -> I   (should be C)
    "Answer: A careful reading ... C" -> A   (should be C)

24 embedded cases are 10-choice, so this is reachable. The forward scan
now skips a standalone capital that begins a same-line word or a
contraction; the reverse scan still recovers it when it is the only
candidate ("Answer: A is correct"), and a genuine standalone answer
("Answer: I.") is unchanged.

Known limitation: a distractor explicitly rejected *before* the chosen
letter on the same line ("... rules out C, leaving D") is still misread;
that needs sentence-level parsing and is left for discussion.

Locked by new --self-test-extractors cases (model-free). All prior
self-tests and the integer/regrade cases continue to pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves the residual multiple-choice false negative tracked in antirez#321
(follow-up to antirez#319). find_answer_letter returned the first
boundary-isolated in-range capital on the answer line, so when the model
rejects a distractor before stating its pick the rejected letter won:

    "Answer: It is not B, the answer is D"   -> B   (should be D)
    "Answer: rules out C, leaving D"         -> C   (should be D)

The forward scan now skips an in-range capital that is the object of an
explicit, adjacent rejection -- "not B", "isn't B", "rule(s/d) out C",
"eliminate/reject/except E". The cue set is deliberately small and
high-precision: it rewrites only clear elimination phrasing and makes no
attempt at general selection-vs-rejection sentence parsing, which is the
principled limit noted in antirez#321. It never looks before the answer marker
or across a newline, so a leading pick is accepted before any later
rejected distractor is inspected -- "Answer: D, not B" still grades D,
the regression that blocks the naive "take the last letter" fix.

Locked by new --self-test-extractors cases (model-free, no model/GPU),
including the "D, not B" guard. All prior extractor, integer, and
regrade self-tests continue to pass.

Tested: make ds4-eval (Metal) and make cpu build clean (-Wall -Wextra);
./ds4-eval --self-test-extractors passes on both the Metal and CPU
binaries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rinaldofesta rinaldofesta force-pushed the fix/eval-grader-false-negatives branch from d043fdd to b2533d1 Compare June 8, 2026 05:57
@rinaldofesta rinaldofesta changed the title ds4-eval: fix three answer-grader false negatives + golden self-tests ds4-eval: fix four answer-grader false negatives + golden self-tests Jun 8, 2026
@rinaldofesta
Copy link
Copy Markdown
Author

Folded the #321 resolution in as commit 3 (b2533d1) and rebased the branch on latest main (clean — the new ds4_eval.c changes are in option parsing / main(), no overlap with the extractor). The forward scan now skips an in-range capital that is the object of an explicit adjacent rejection (not B, isn't B, rule(s/d) out C); "Answer: D, not B" -> D is locked as a self-test so the leading pick still wins. Cue set kept intentionally small — no general sentence parsing. ./ds4-eval --self-test-extractors passes (Metal + CPU). Happy to drop the commit back out if you'd rather keep the lexical extractor strictly negation-free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ds4-eval: answer-letter extractor mis-grades a distractor stated before the chosen option (follow-up to #319)

1 participant