core: don't recompute ReduceMax scalar after the SIMD max kernel by czoli1976 · Pull Request #2367 · sonos/tract

czoli1976 · 2026-06-13T13:19:49Z

What

max_t (the f32 ReduceMax reducer in core/src/ops/nn/reduce.rs) called the vectorized max_f32 linalg kernel, discarded its result, then unconditionally recomputed the max with a scalar partial-ord fold over the same slice:

if T::datum_type() == f32::datum_type() && let Some(slice) = v.as_slice() {
    let slice = unsafe { transmute::<&[T], &[f32]>(slice) };
    (tract_linalg::ops().max_f32)().run(slice).unwrap();   // result thrown away
}
v.fold(T::min_value(), |acc, &v| if acc > v { acc } else { v })   // recomputed scalar

So every f32 ReduceMax did the reduction twice and was effectively scalar-bound — the "optimized" path was strictly slower than having no kernel at all.

The fix returns the SIMD kernel's result for the f32 contiguous case, and falls through to the scalar fold only for non-f32 dtypes, non-contiguous (strided) slices, or empty slices.

Benchmark

M-series, f32 max over the trailing (contiguous) axis, via the added core/examples/reduce_max_bench.rs:

shape	before	after	speedup
1024 × 4096	2.44 ms (6.9 GB/s)	0.32 ms (51.8 GB/s)	7.5×
4096 × 1024	2.46 ms (6.8 GB/s)	0.42 ms (40.1 GB/s)	5.9×
256 × 65536	9.44 ms (7.1 GB/s)	1.04 ms (64.9 GB/s)	9.1×

Identical results. Benefits ReduceMax, MaxPool, and the softmax max pre-pass.

Testing

New unit test reduce_max_f32_contiguous_and_strided covers both branches (contiguous incl. non-multiple-of-SIMD-width tail, strided, single-element).
Full tract-core suite passes (248).

Files

core/src/ops/nn/reduce.rs — use the SIMD result; + test
core/examples/reduce_max_bench.rs — benchmark

🤖 Generated with Claude Code

`max_t` (the f32 ReduceMax reducer) called the vectorized `max_f32` linalg kernel, *discarded its result*, then unconditionally recomputed the max with a scalar partial-ord fold over the same slice — so ReduceMax did the reduction twice and was effectively scalar-bound (the "optimized" path was strictly slower than having no kernel at all). Return the SIMD kernel's result for the f32 contiguous case; fall through to the scalar fold only for non-f32 dtypes, non-contiguous (strided) slices, or empty slices. Adds a correctness test covering both branches (contiguous + tail, strided, single-element). Benchmark (M-series, f32 max over the trailing axis, via the added reduce_max_bench example): shape before after speedup 1024 x 4096 2.44 ms / 6.9GB/s 0.32 ms / 52GB/s 7.5x 4096 x 1024 2.46 ms / 6.8GB/s 0.42 ms / 40GB/s 5.9x 256 x 65536 9.44 ms / 7.1GB/s 1.04 ms / 65GB/s 9.1x Identical results. Benefits ReduceMax, MaxPool and the softmax max pre-pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

kali · 2026-06-17T18:05:10Z

Rebased!

czoli1976 mentioned this pull request Jun 13, 2026

linalg,core: SIMD ReduceMin (mirror the max reducer) #2368

Open

czoli1976 force-pushed the perf/reduce-max-no-double branch from af0dc3a to 760b977 Compare June 13, 2026 14:12

kali force-pushed the perf/reduce-max-no-double branch from 760b977 to 0c9ae85 Compare June 17, 2026 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: don't recompute ReduceMax scalar after the SIMD max kernel#2367

core: don't recompute ReduceMax scalar after the SIMD max kernel#2367
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:perf/reduce-max-no-double

czoli1976 commented Jun 13, 2026

Uh oh!

kali commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

czoli1976 commented Jun 13, 2026

What

Benchmark

Testing

Files

Uh oh!

kali commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants