linalg,core: SIMD ReduceMin (mirror the max reducer)#2368
Open
czoli1976 wants to merge 2 commits into
Open
Conversation
`max_t` (the f32 ReduceMax reducer) called the vectorized `max_f32` linalg kernel, *discarded its result*, then unconditionally recomputed the max with a scalar partial-ord fold over the same slice — so ReduceMax did the reduction twice and was effectively scalar-bound (the "optimized" path was strictly slower than having no kernel at all). Return the SIMD kernel's result for the f32 contiguous case; fall through to the scalar fold only for non-f32 dtypes, non-contiguous (strided) slices, or empty slices. Adds a correctness test covering both branches (contiguous + tail, strided, single-element). Benchmark (M-series, f32 max over the trailing axis, via the added reduce_max_bench example): shape before after speedup 1024 x 4096 2.44 ms / 6.9GB/s 0.32 ms / 52GB/s 7.5x 4096 x 1024 2.46 ms / 6.8GB/s 0.42 ms / 40GB/s 5.9x 256 x 65536 9.44 ms / 7.1GB/s 1.04 ms / 65GB/s 9.1x Identical results. Benefits ReduceMax, MaxPool and the softmax max pre-pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`min_t` had no SIMD path — it ran a scalar branchy partial-ord fold while `max_t` uses a hand-vectorized `max_f32` kernel. Mirror the max reducer for min: - generic `SMin4` (fallback), arm64 NEON `arm64simd_min_f32_16n` (fminv), and x86 AVX2 `x86_64_fma_min_f32_32n` (vminps) — same structure/wiring as their max counterparts, registered as `Ops::min_f32`. - `min_t` now routes the contiguous f32 case through `min_f32` and falls back to the scalar fold for non-f32 / strided / empty slices (same shape as `max_t`). - `min_frame_tests!` macro (mirror of `max_frame_tests!`) validates each kernel against the reference; + a core reduce-min correctness test. Note: the generic-framework reducer is actually *slower* than the scalar fold (measured ~3 vs ~8 GB/s), so the win comes from the hand-written NEON/AVX2 kernels, not the generic path — matching how max behaves (generic is a correctness fallback only). Benchmark (M-series, f32 min over the trailing axis, scalar fold vs NEON, via the added reduce_min_bench example): shape scalar NEON speedup 1024 x 4096 2.05 ms 0.34 ms 6.0x (49 GB/s) 4096 x 1024 1.84 ms 0.38 ms 4.8x (44 GB/s) 256 x 65536 7.79 ms 1.05 ms 7.4x (64 GB/s) (x86 AVX2 kernel mirrors the validated max kernel; correctness covered by the min frame test in CI, perf parity by construction.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
412ad5e to
602c714
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
min_t(the f32ReduceMinreducer) had no SIMD path — it ran a scalar branchy partial-ord fold, whilemax_tuses a hand-vectorizedmax_f32kernel. This mirrors the max reducer for min:SMin4(fallback), arm64 NEONarm64simd_min_f32_16n(fmin/fminv), and x86 AVX2x86_64_fma_min_f32_32n(vminps) — same structure/wiring as their max counterparts, registered asOps::min_f32.min_troutes the contiguous f32 case throughmin_f32, falling back to the scalar fold for non-f32 / strided / empty slices (same shape asmax_t).min_frame_tests!macro (mirror ofmax_frame_tests!) validates each kernel against the reference; plus a core reduce-min correctness test.Honest note on the generic path
While implementing this I measured that the generic-framework reducer is actually slower than the scalar fold (~3 vs ~8 GB/s) — the per-row framework overhead outweighs its 4-wide inner loop. So the win comes entirely from the hand-written NEON/AVX2 kernels, not the generic path. This matches how
maxalready behaves: its genericSMax4is a correctness fallback, and the speed comes from the arm64/x86 kernels. (On purely generic/wasm builds,min_f32uses the generic reducer, same asmax_f32.)Benchmark
M-series, f32 min over the trailing (contiguous) axis, scalar fold vs NEON, via the added
core/examples/reduce_min_bench.rs:Benefits
ReduceMinandMinPool.Testing
min_frame_tests!runs against the generic + NEON kernels (and the x86 kernel on x86 CI), each checked vs the reference.reduce_min_f32_contiguous_and_strided(contiguous incl. tail, strided).tract-core+tract-linalgsuites pass on arm64.The x86 AVX2 kernel mirrors the already-validated max kernel (
vmaxps→vminps); I can't perf-test it on this Apple-Silicon host, but its correctness is covered by the min frame test in x86 CI and perf parity follows by construction.🤖 Generated with Claude Code