Skip to content

Fix/bench retry bugs#2384

Closed
kali wants to merge 4 commits into
mainfrom
fix/bench-retry-bugs
Closed

Fix/bench retry bugs#2384
kali wants to merge 4 commits into
mainfrom
fix/bench-retry-bugs

Conversation

@kali

@kali kali commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

kali added 4 commits June 17, 2026 09:53
The report flagged reds against the latest nightly value while the bundle's retry
compared against the median shipped in expectations — so when they differed a red
could appear that was never retried (breaking retry==red). Single-source the baseline
in bench_common.reference_value (median of recent non-null) and use it for both the
report's reference and the shipped expectation. Median also makes the reference robust
to a noisy last nightly.
A superseding PR push cancels the bench via concurrency, but the report job ran on
always() with no results and posted a vacuous '0 metrics / no regressions' over the
real comment. Gate the report on !cancelled() instead of always(), have bench-report
skip writing a comment when there are no comparable metrics, and have the post step
skip when no comment file exists.
…iagnosing mac no-retry)

Temporary: print EXPECTATIONS path/-s/line-count/pwd before the bench loop, and whether
bench_run takes the single-shot or retry path (once). To pin why the macOS runner logs
0 retries despite a present, correct expectations file. Revert once answered.
The cuda runner free-boosts from idle (210MHz) up to its max, and the boost state
varies session-to-session, adding ~16% variance to GPU evaltime that the in-run retry
can't damp (shared session). Lock the graphics clock for the bench and reset it on
exit via a trap. Best-effort (needs privilege, like the CPU governor pin; no-op without
nvidia-smi). Defaults to the max supported clock; set BENCH_GPU_CLOCK if that throttles.
@github-actions

Copy link
Copy Markdown

🔴 Bench vs main — 3 speed regression(s)

Reference: main nightly, latest 2026-06-17 (0d old) · PR f9cf81572 · ran on apple-m1-max, i9-11900kb_rtx-4060, jetson-orin-nx · 1089 metrics compared

Speed — evaltime · prefill · decode

Δ metric device main → PR
🔴 +38.3% speaker_id
evaltime · pulse8
apple-m1-max 0.067 ms → 0.0926 ms
🔴 +14.9% parakeet_tdt_600m_v3_f32f32_decoder_pass
evaltime · cuda
jetson-orin-nx 0.989 ms → 1.14 ms
🔴 +8.6% parakeet_tdt_600m_v3_f32f32_preprocessor_1s
evaltime · cuda
i9-11900kb_rtx-4060 3.26 ms → 3.54 ms
🟢 17 improvement(s)
Δ metric device main → PR
🟢 -7.8% voicecom_fake_quant
evaltime · 2sec
i9-11900kb_rtx-4060 5.21 ms → 4.81 ms
🟢 -6.0% voicecom_float
evaltime · 2sec
i9-11900kb_rtx-4060 4.64 ms → 4.36 ms
🟢 -70.9% parakeet_tdt_600m_v3_f32f32_preprocessor_1s
load+optimize · metal
apple-m1-max 254 ms → 74 ms
🟢 -69.8% parakeet_tdt_600m_v3_f32f32_joint_pass
load+optimize · metal
apple-m1-max 262 ms → 79 ms
🟢 -65.8% parakeet_tdt_600m_v3_f32f32_decoder_pass
load+optimize · metal
apple-m1-max 272 ms → 93 ms
🟢 -45.1% parakeet_tdt_600m_v3_f32f32_preprocessor_1s
RSS @ ready · metal
apple-m1-max 116 MB → 63.4 MB
🟢 -31.4% parakeet_tdt_600m_v3_f32f32_decoder_pass
RSS @ ready · metal
apple-m1-max 165 MB → 113 MB
🟢 -20.1% openelm_270M_q40ef16_541
load+optimize · metal
apple-m1-max 820 ms → 655 ms
🟢 -20.0% llama_3_2_1B_instruct_q40ef16_541
load+optimize · metal
apple-m1-max 908 ms → 726 ms
🟢 -18.9% openelm_270M_q40ef16_516
load+optimize · metal
apple-m1-max 927 ms → 751 ms
🟢 -17.1% openelm_270M_q40ef16_541
RSS @ ready · metal
apple-m1-max 332 MB → 275 MB
🟢 -16.8% openelm_270M_q40ef16_516
RSS @ ready · metal
apple-m1-max 333 MB → 277 MB
🟢 -12.0% qwen3_1_7B_q40ef16_541
load+optimize · metal
apple-m1-max 1.51 s → 1.33 s
🟢 -10.5% llama_3_2_3B_q40ef32_516
load+optimize · metal
apple-m1-max 1.75 s → 1.57 s
🟢 -7.2% llama_3_1_8B_instruct_q40ef16_541
load+optimize · metal
apple-m1-max 2.55 s → 2.37 s
🟢 -6.6% llama_3_2_1B_instruct_q40ef16_541
RSS @ ready · metal
apple-m1-max 837 MB → 782 MB
🟢 -6.6% llama_3_2_1B_q40ef32_516
RSS @ ready · metal
apple-m1-max 838 MB → 782 MB

lower is better except prefill/decode (tok/s) · adaptive thresholds (max(floor, k×noise) vs the series' own history) · single-shot vs nightly reference · full table → run summary

@kali

kali commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

you've been RIIRed.

@kali kali closed this Jun 17, 2026
kali added a commit that referenced this pull request Jun 17, 2026
Carry the still-relevant fix/bench-retry-bugs fixes into the Rust suite
(that PR edits the shell/python files this branch deletes):
- single-source the comparison baseline as the median of recent non-null
  (bench_common::reference_value), used by both bench-expectations and the
  report, so retry and the PR red judge the same number (retry==red);
- report writes no comment when there are no comparable metrics, and the
  workflow gates the report on !cancelled() + skips posting a missing file,
  so a superseded run can't overwrite a real comment with 'no regressions';
- pin the GPU graphics clock for the run, reset on drop (cuda free-boost
  variance), alongside the CPU governor pin.

Also build/clippy/test the bench-suite feature in the crates workflow (it
is not a default feature, so nothing else compiled it), and add unit tests
for the threshold math.
kali added a commit that referenced this pull request Jun 17, 2026
Carry the still-relevant fix/bench-retry-bugs fixes into the Rust suite
(that PR edits the shell/python files this branch deletes):
- single-source the comparison baseline as the median of recent non-null
  (bench_common::reference_value), used by both bench-expectations and the
  report, so retry and the PR red judge the same number (retry==red);
- report writes no comment when there are no comparable metrics, and the
  workflow gates the report on !cancelled() + skips posting a missing file,
  so a superseded run can't overwrite a real comment with 'no regressions';
- pin the GPU graphics clock for the run, reset on drop (cuda free-boost
  variance), alongside the CPU governor pin.

Also build/clippy/test the bench-suite feature in the crates workflow (it
is not a default feature, so nothing else compiled it), and add unit tests
for the threshold math.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant