Skip to content

core: don't recompute ReduceMax scalar after the SIMD max kernel#2367

Open
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:perf/reduce-max-no-double
Open

core: don't recompute ReduceMax scalar after the SIMD max kernel#2367
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:perf/reduce-max-no-double

Conversation

@czoli1976

Copy link
Copy Markdown
Contributor

What

max_t (the f32 ReduceMax reducer in core/src/ops/nn/reduce.rs) called the vectorized max_f32 linalg kernel, discarded its result, then unconditionally recomputed the max with a scalar partial-ord fold over the same slice:

if T::datum_type() == f32::datum_type() && let Some(slice) = v.as_slice() {
    let slice = unsafe { transmute::<&[T], &[f32]>(slice) };
    (tract_linalg::ops().max_f32)().run(slice).unwrap();   // result thrown away
}
v.fold(T::min_value(), |acc, &v| if acc > v { acc } else { v })   // recomputed scalar

So every f32 ReduceMax did the reduction twice and was effectively scalar-bound — the "optimized" path was strictly slower than having no kernel at all.

The fix returns the SIMD kernel's result for the f32 contiguous case, and falls through to the scalar fold only for non-f32 dtypes, non-contiguous (strided) slices, or empty slices.

Benchmark

M-series, f32 max over the trailing (contiguous) axis, via the added core/examples/reduce_max_bench.rs:

shape before after speedup
1024 × 4096 2.44 ms (6.9 GB/s) 0.32 ms (51.8 GB/s) 7.5×
4096 × 1024 2.46 ms (6.8 GB/s) 0.42 ms (40.1 GB/s) 5.9×
256 × 65536 9.44 ms (7.1 GB/s) 1.04 ms (64.9 GB/s) 9.1×

Identical results. Benefits ReduceMax, MaxPool, and the softmax max pre-pass.

Testing

  • New unit test reduce_max_f32_contiguous_and_strided covers both branches (contiguous incl. non-multiple-of-SIMD-width tail, strided, single-element).
  • Full tract-core suite passes (248).

Files

  • core/src/ops/nn/reduce.rs — use the SIMD result; + test
  • core/examples/reduce_max_bench.rs — benchmark

🤖 Generated with Claude Code

`max_t` (the f32 ReduceMax reducer) called the vectorized `max_f32` linalg
kernel, *discarded its result*, then unconditionally recomputed the max with a
scalar partial-ord fold over the same slice — so ReduceMax did the reduction
twice and was effectively scalar-bound (the "optimized" path was strictly slower
than having no kernel at all).

Return the SIMD kernel's result for the f32 contiguous case; fall through to the
scalar fold only for non-f32 dtypes, non-contiguous (strided) slices, or empty
slices. Adds a correctness test covering both branches (contiguous + tail,
strided, single-element).

Benchmark (M-series, f32 max over the trailing axis, via the added
reduce_max_bench example):

  shape          before            after            speedup
  1024 x 4096    2.44 ms / 6.9GB/s 0.32 ms / 52GB/s  7.5x
  4096 x 1024    2.46 ms / 6.8GB/s 0.42 ms / 40GB/s  5.9x
  256  x 65536   9.44 ms / 7.1GB/s 1.04 ms / 65GB/s  9.1x

Identical results. Benefits ReduceMax, MaxPool and the softmax max pre-pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@kali

kali commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Rebased!

@kali kali force-pushed the perf/reduce-max-no-double branch from 760b977 to 0c9ae85 Compare June 17, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants