Fix/bench retry bugs by kali · Pull Request #2384 · sonos/tract

kali · 2026-06-17T11:35:59Z

No description provided.

The report flagged reds against the latest nightly value while the bundle's retry compared against the median shipped in expectations — so when they differed a red could appear that was never retried (breaking retry==red). Single-source the baseline in bench_common.reference_value (median of recent non-null) and use it for both the report's reference and the shipped expectation. Median also makes the reference robust to a noisy last nightly.

A superseding PR push cancels the bench via concurrency, but the report job ran on always() with no results and posted a vacuous '0 metrics / no regressions' over the real comment. Gate the report on !cancelled() instead of always(), have bench-report skip writing a comment when there are no comparable metrics, and have the post step skip when no comment file exists.

…iagnosing mac no-retry) Temporary: print EXPECTATIONS path/-s/line-count/pwd before the bench loop, and whether bench_run takes the single-shot or retry path (once). To pin why the macOS runner logs 0 retries despite a present, correct expectations file. Revert once answered.

The cuda runner free-boosts from idle (210MHz) up to its max, and the boost state varies session-to-session, adding ~16% variance to GPU evaltime that the in-run retry can't damp (shared session). Lock the graphics clock for the bench and reset it on exit via a trap. Best-effort (needs privilege, like the CPU governor pin; no-op without nvidia-smi). Defaults to the max supported clock; set BENCH_GPU_CLOCK if that throttles.

github-actions · 2026-06-17T12:25:00Z

🔴 Bench vs main — 3 speed regression(s)

Reference: main nightly, latest 2026-06-17 (0d old) · PR f9cf81572 · ran on apple-m1-max, i9-11900kb_rtx-4060, jetson-orin-nx · 1089 metrics compared

Speed — evaltime · prefill · decode

Δ	metric	device	main → PR
🔴 +38.3%	speaker_id _{evaltime · pulse8}	`apple-m1-max`	0.067 ms → 0.0926 ms
🔴 +14.9%	parakeet_tdt_600m_v3_f32f32_decoder_pass _{evaltime · cuda}	`jetson-orin-nx`	0.989 ms → 1.14 ms
🔴 +8.6%	parakeet_tdt_600m_v3_f32f32_preprocessor_1s _{evaltime · cuda}	`i9-11900kb_rtx-4060`	3.26 ms → 3.54 ms

🟢 17 improvement(s)

Δ	metric	device	main → PR
🟢 -7.8%	voicecom_fake_quant _{evaltime · 2sec}	`i9-11900kb_rtx-4060`	5.21 ms → 4.81 ms
🟢 -6.0%	voicecom_float _{evaltime · 2sec}	`i9-11900kb_rtx-4060`	4.64 ms → 4.36 ms
🟢 -70.9%	parakeet_tdt_600m_v3_f32f32_preprocessor_1s _{load+optimize · metal}	`apple-m1-max`	254 ms → 74 ms
🟢 -69.8%	parakeet_tdt_600m_v3_f32f32_joint_pass _{load+optimize · metal}	`apple-m1-max`	262 ms → 79 ms
🟢 -65.8%	parakeet_tdt_600m_v3_f32f32_decoder_pass _{load+optimize · metal}	`apple-m1-max`	272 ms → 93 ms
🟢 -45.1%	parakeet_tdt_600m_v3_f32f32_preprocessor_1s _{RSS @ ready · metal}	`apple-m1-max`	116 MB → 63.4 MB
🟢 -31.4%	parakeet_tdt_600m_v3_f32f32_decoder_pass _{RSS @ ready · metal}	`apple-m1-max`	165 MB → 113 MB
🟢 -20.1%	openelm_270M_q40ef16_541 _{load+optimize · metal}	`apple-m1-max`	820 ms → 655 ms
🟢 -20.0%	llama_3_2_1B_instruct_q40ef16_541 _{load+optimize · metal}	`apple-m1-max`	908 ms → 726 ms
🟢 -18.9%	openelm_270M_q40ef16_516 _{load+optimize · metal}	`apple-m1-max`	927 ms → 751 ms
🟢 -17.1%	openelm_270M_q40ef16_541 _{RSS @ ready · metal}	`apple-m1-max`	332 MB → 275 MB
🟢 -16.8%	openelm_270M_q40ef16_516 _{RSS @ ready · metal}	`apple-m1-max`	333 MB → 277 MB
🟢 -12.0%	qwen3_1_7B_q40ef16_541 _{load+optimize · metal}	`apple-m1-max`	1.51 s → 1.33 s
🟢 -10.5%	llama_3_2_3B_q40ef32_516 _{load+optimize · metal}	`apple-m1-max`	1.75 s → 1.57 s
🟢 -7.2%	llama_3_1_8B_instruct_q40ef16_541 _{load+optimize · metal}	`apple-m1-max`	2.55 s → 2.37 s
🟢 -6.6%	llama_3_2_1B_instruct_q40ef16_541 _{RSS @ ready · metal}	`apple-m1-max`	837 MB → 782 MB
🟢 -6.6%	llama_3_2_1B_q40ef32_516 _{RSS @ ready · metal}	`apple-m1-max`	838 MB → 782 MB

lower is better except prefill/decode (tok/s) · adaptive thresholds (max(floor, k×noise) vs the series' own history) · single-shot vs nightly reference · full table → run summary

kali · 2026-06-17T14:57:21Z

you've been RIIRed.

Carry the still-relevant fix/bench-retry-bugs fixes into the Rust suite (that PR edits the shell/python files this branch deletes): - single-source the comparison baseline as the median of recent non-null (bench_common::reference_value), used by both bench-expectations and the report, so retry and the PR red judge the same number (retry==red); - report writes no comment when there are no comparable metrics, and the workflow gates the report on !cancelled() + skips posting a missing file, so a superseded run can't overwrite a real comment with 'no regressions'; - pin the GPU graphics clock for the run, reset on drop (cuda free-boost variance), alongside the CPU governor pin. Also build/clippy/test the bench-suite feature in the crates workflow (it is not a default feature, so nothing else compiled it), and add unit tests for the threshold math.

kali added 4 commits June 17, 2026 09:53

kali closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/bench retry bugs#2384

Fix/bench retry bugs#2384
kali wants to merge 4 commits into
mainfrom
fix/bench-retry-bugs

kali commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

kali commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kali commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

kali commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant