perf: 5.5x faster `fast_blur` with u32 accumulators by art049 · Pull Request #2846 · image-rs/image

art049 · 2026-03-12T22:11:44Z

Following #2809
This was already reviewed a bit here

Changes

Use u32 integer accumulators for u8 fast blur — The box blur hot path used f32 accumulators for all pixel types. For u8 images (the dominant case), every pixel went through to_f32/from_f32 conversions and software roundf. Replaced with u32 integer arithmetic.
Replace roundf with fast truncation in FloatNearest::to_u8/to_u16 for non optimized architectures

Results

Walltime on x86 on a build without sse4.1
Simulation results from CodSpeed

Walltime

Benchmark	Baseline	Optimized	Speedup
fast blur: sigma 3.0	41.8 ms	7.7 ms	×5.5
fast blur: sigma 7.0	42.1 ms	7.7 ms	×5.4
fast blur: sigma 50.0	42.5 ms	8.4 ms	×5.0
fast_blur	41.9 ms	8.5 ms	×4.9
gaussian blur: sigma 3.0	24.1 ms	20.6 ms	+17%
gaussian blur: sigma 7.0	39.6 ms	37.5 ms	+5% (noise)
gaussian blur: sigma 50.0	364.1 ms	379.1 ms	-4% (noise)

Simulation

Benchmark	Baseline	Optimized	Speedup
fast blur: sigma 3.0	254.3 ms	45.3 ms	×5.6
fast blur: sigma 7.0	254.8 ms	45.6 ms	×5.6
fast blur: sigma 50.0	260.6 ms	50.3 ms	×5.2
fast_blur	260.6 ms	50.3 ms	×5.2
gaussian blur: sigma 3.0	159.8 ms	137.1 ms	+17%
gaussian blur: sigma 7.0	270.1 ms	247.4 ms	+8%
gaussian blur: sigma 50.0	1.38 s	1.36 s	unchanged

Use (v + 0.5) as u32 instead of .round() (which calls libm roundf). Under CodSpeed simulation, roundf costs hundreds of instructions per call; the +0.5 truncation trick is equivalent for the non-negative range produced by gaussian blur and avoids the libm call entirely.

The box blur hot path used f32 accumulators for all pixel types. For u8 images (the dominant case), this meant every pixel went through to_f32/from_f32 conversions and software roundf. Replace with u32 integer arithmetic for u8 pixels, dispatched at compile time by pixel size and channel count (1-4). ~1.64x wall-clock speedup on all fast blur benchmarks. wip

Precompute ceil(2^32 / kernel_size) once per pass, then use a multiply-shift to normalize accumulators instead of integer division.

Introduce fast_round_f32 that delegates to hardware rounding (roundss/frintn) on SSE 4.1 and aarch64, and falls back to the mantissa snapping trick ((x + 2^23) - 2^23) elsewhere to avoid the costly libm roundf call.

Clarify that this function only works for positive (non-negative) inputs and must not be used for signed integer pixel types.

- Move BlurAccum, U8Weight, and rounding_saturating_mul into the sealed module alongside PrimitiveSealed - Add BlurAccum as a supertrait of Primitive, removing the need for explicit where bounds and #[expect(private_bounds)] on fast_blur - Implement BlurAccum for all Primitive types (u8 with u32 accumulators, all others with f32) - Rename acc_zero/acc_scale/acc_to_store to ZERO/scale/to_store - Simplify rounding_saturating_mul bounds to just T: Primitive

awxkee · 2026-03-12T23:43:05Z

src/math/utils.rs

+#[inline(always)]
+pub(crate) fn fast_round_positive_f32(x: f32) -> f32 {
+    const MAGIC: f32 = (1u32 << 23) as f32; // 8_388_608.0
+    (x + MAGIC) - MAGIC


We typically want rounding to nearest everywhere, this won't round '0.5' to the nearest, this is relatively important if we're unlucky and if alpha channel is blurred to 254.5, then this will make opaque image transparent; this isn't good at all.

This behaviour can be broken by adding EPS to x.
In perfect world we could do (f32::from_bits(x.to_bits() + 1) but world is not perfect, so float number and integral uses different ports and this addition will cost almost twice more, accounting data transfer between units.

This doesn't round negatives properly because this MAGIC is wrong, it should be:

const MAGIC: f32 = ((1u32 << 23) + (1u32 << 22)) as f32;

Floating-point numbers round the results of addition (and other operations) using round ties even. So all integers of the form X.5 (X is a positive/negative integer or zero) will be rounded to the nearest even number. There's no way around that. Even the "correct" magic number has that issue.

So I would say that this function is implemented as correctly as can be.

I meant something like this one

As well this should correctly break ties almost everywhere (where this method can work, it can't work on the whole range ) because that's should be enough by float definition to break the ties. However, I never performed an exhaustive check.

Part about changed magic touches only "negatives" because mantissa is wrong, see here

RunDevelopment

Amazing stuff @art049!

Before I go into details about the code itself, I'd like to talk about codspeed. Codspeed integration should be a separate PR, so the maintainers of image can discuss it without that blocking your improvements to fast_blur. So please revert the changes to Cargo.toml, README.md, and .github/workflows/codspeed.yml.

I only have a few nits about the code. Please see my comments.

I also want to say that I love the way you documented all the non-trivial fixed-point math tricks. It's easy to understand and verify.

RunDevelopment · 2026-03-16T11:45:53Z

src/imageops/fast_blur.rs

-        let mut weight1: f32 = 0.;
-        let mut weight2: f32 = 0.;
-        let mut weight3: f32 = 0.;
+        let mut sums = [P::ZERO; CN];


P::ZERO is highly confusing. Took me about 5 minutes to realize that u8::ZERO == 0_u32. Not obvious that this is BlurAccum::ZERO.

Maybe we could rename it to EMPTY_ACCUMULATOR? Or maybe even restructure the trait like I suggested below?

RunDevelopment · 2026-03-16T12:15:16Z

src/traits.rs

+    /// Accumulator abstraction for box blur.
+    /// `u8` uses `u32` integer accumulators; other types use `f32`.
+    pub trait BlurAccum: Copy + Sized {
+        type Acc: Copy + Add<Output = Self::Acc> + AddAssign + Sub<Output = Self::Acc> + SubAssign;
+        type Weight: Copy;
+
+        const ZERO: Self::Acc;
+
+        fn to_acc(self) -> Self::Acc;
+        fn scale(acc: Self::Acc, count: usize) -> Self::Acc;
+        fn make_weight(kernel_size: usize) -> Self::Weight;
+        fn to_store(acc: Self::Acc, weight: Self::Weight) -> Self;
+    }


This might be a stupid suggestion, but I feel like this trait is the wrong way around.

This trait is implemented for primitives, but the primitives are not the blur accumulators. Self::Acc is. So I think it would be more natural to make the accumulator one trait (separate from Primitive) and give each primitive an associated type like this:

pub trait WithBlurAcc { // sealed supertrait of `Primitive` type BlurAcc: BlurAccumulator<Self>; } pub(crate) trait BlurAccumulator<T>: Copy + Sized + Add<Output = Self> + AddAssign + Sub<Output = Self> + SubAssign { const ZERO: Self; fn from_primitive(value: T) -> Self; fn scale(self, count: usize) -> Self; type Weight: Copy; fn create_weight(self, kernel_size: usize) -> Self::Weight; fn to_store(self, weight: Self::Weight) -> T; } // general implementation impl<T: Primtive> BlurAccumulator<T> for f32 { ... } // optimized implementation for `u8` impl BlurAccumulator<u8> for u32 { ... } // pick the right implementation impl WithBlurAcc for u8 { type BlurAcc = u32; } impl WithBlurAcc for u16, u32, u64, usize, i8, i16, i32, i64, isize, f32, f64 { type BlurAcc = f32; }

This might be cleaner, since trait BlurAccumulator<T> and its implementations can live inside fast_blur.rs. This will keep all blurring logic local to one file.

(@197g ping for thoughts)

RunDevelopment · 2026-03-16T12:23:43Z

src/imageops/fast_blur.rs

 #[inline]
 #[allow(clippy::manual_clamp)]


These attributes are left over from fn rounding_saturating_mul and unnecessary. Please remove them.

codspeed-hq bot and others added 7 commits March 12, 2026 22:56

Add CodSpeed continuous performance benchmarking

fa236f7

Replace integer division with reciprocal multiplication in u8 blur

4d250d7

Precompute ceil(2^32 / kernel_size) once per pass, then use a multiply-shift to normalize accumulators instead of integer division.

Use platform-aware fast rounding for FloatNearest::to_u8/to_u16

83569f4

Introduce fast_round_f32 that delegates to hardware rounding (roundss/frintn) on SSE 4.1 and aarch64, and falls back to the mantissa snapping trick ((x + 2^23) - 2^23) elsewhere to avoid the costly libm roundf call.

Rename fast_round_f32 to fast_round_positive_f32

853c715

Clarify that this function only works for positive (non-negative) inputs and must not be used for signed integer pixel types.

awxkee reviewed Mar 12, 2026

View reviewed changes

Fix rustfmt and clippy CI failures

52b0747

RunDevelopment reviewed Mar 16, 2026

View reviewed changes

RunDevelopment mentioned this pull request Mar 16, 2026

Add NearestFrom for faster fast_blur #2868

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: 5.5x faster `fast_blur` with u32 accumulators#2846

perf: 5.5x faster `fast_blur` with u32 accumulators#2846
art049 wants to merge 8 commits intoimage-rs:mainfrom
art049:perf/faster-blur

art049 commented Mar 12, 2026

Uh oh!

awxkee Mar 12, 2026 •

edited

Loading

Uh oh!

RunDevelopment Mar 16, 2026

Uh oh!

awxkee Mar 16, 2026

Uh oh!

awxkee Mar 16, 2026 •

edited

Loading

Uh oh!

awxkee Mar 16, 2026

Uh oh!

RunDevelopment left a comment

Uh oh!

RunDevelopment Mar 16, 2026

Uh oh!

RunDevelopment Mar 16, 2026

Uh oh!

RunDevelopment Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

art049 commented Mar 12, 2026

Changes

Results

Walltime

Simulation

Uh oh!

awxkee Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RunDevelopment Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

awxkee Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

awxkee Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awxkee Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

RunDevelopment left a comment

Choose a reason for hiding this comment

Uh oh!

RunDevelopment Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

RunDevelopment Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

RunDevelopment Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

awxkee Mar 12, 2026 •

edited

Loading

awxkee Mar 16, 2026 •

edited

Loading