core: don't recompute ReduceMax scalar after the SIMD max kernel#2367
Open
czoli1976 wants to merge 1 commit into
Open
core: don't recompute ReduceMax scalar after the SIMD max kernel#2367czoli1976 wants to merge 1 commit into
czoli1976 wants to merge 1 commit into
Conversation
af0dc3a to
760b977
Compare
`max_t` (the f32 ReduceMax reducer) called the vectorized `max_f32` linalg kernel, *discarded its result*, then unconditionally recomputed the max with a scalar partial-ord fold over the same slice — so ReduceMax did the reduction twice and was effectively scalar-bound (the "optimized" path was strictly slower than having no kernel at all). Return the SIMD kernel's result for the f32 contiguous case; fall through to the scalar fold only for non-f32 dtypes, non-contiguous (strided) slices, or empty slices. Adds a correctness test covering both branches (contiguous + tail, strided, single-element). Benchmark (M-series, f32 max over the trailing axis, via the added reduce_max_bench example): shape before after speedup 1024 x 4096 2.44 ms / 6.9GB/s 0.32 ms / 52GB/s 7.5x 4096 x 1024 2.46 ms / 6.8GB/s 0.42 ms / 40GB/s 5.9x 256 x 65536 9.44 ms / 7.1GB/s 1.04 ms / 65GB/s 9.1x Identical results. Benefits ReduceMax, MaxPool and the softmax max pre-pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Collaborator
|
Rebased! |
760b977 to
0c9ae85
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
max_t(the f32ReduceMaxreducer incore/src/ops/nn/reduce.rs) called the vectorizedmax_f32linalg kernel, discarded its result, then unconditionally recomputed the max with a scalar partial-ord fold over the same slice:So every f32
ReduceMaxdid the reduction twice and was effectively scalar-bound — the "optimized" path was strictly slower than having no kernel at all.The fix returns the SIMD kernel's result for the f32 contiguous case, and falls through to the scalar fold only for non-f32 dtypes, non-contiguous (strided) slices, or empty slices.
Benchmark
M-series, f32 max over the trailing (contiguous) axis, via the added
core/examples/reduce_max_bench.rs:Identical results. Benefits
ReduceMax,MaxPool, and the softmax max pre-pass.Testing
reduce_max_f32_contiguous_and_stridedcovers both branches (contiguous incl. non-multiple-of-SIMD-width tail, strided, single-element).tract-coresuite passes (248).Files
core/src/ops/nn/reduce.rs— use the SIMD result; + testcore/examples/reduce_max_bench.rs— benchmark🤖 Generated with Claude Code