Skip to content

perf: 5.5x faster fast_blur with u32 accumulators#2846

Open
art049 wants to merge 8 commits intoimage-rs:mainfrom
art049:perf/faster-blur
Open

perf: 5.5x faster fast_blur with u32 accumulators#2846
art049 wants to merge 8 commits intoimage-rs:mainfrom
art049:perf/faster-blur

Conversation

@art049
Copy link

@art049 art049 commented Mar 12, 2026

Following #2809
This was already reviewed a bit here

Changes

  • Use u32 integer accumulators for u8 fast blur — The box blur hot path used f32 accumulators for all pixel types. For u8 images (the dominant case), every pixel went through to_f32/from_f32 conversions and software roundf. Replaced with u32 integer arithmetic.
  • Replace roundf with fast truncation in FloatNearest::to_u8/to_u16 for non optimized architectures

Results

Walltime on x86 on a build without sse4.1
Simulation results from CodSpeed

Walltime

Benchmark Baseline Optimized Speedup
fast blur: sigma 3.0 41.8 ms 7.7 ms ×5.5
fast blur: sigma 7.0 42.1 ms 7.7 ms ×5.4
fast blur: sigma 50.0 42.5 ms 8.4 ms ×5.0
fast_blur 41.9 ms 8.5 ms ×4.9
gaussian blur: sigma 3.0 24.1 ms 20.6 ms +17%
gaussian blur: sigma 7.0 39.6 ms 37.5 ms +5% (noise)
gaussian blur: sigma 50.0 364.1 ms 379.1 ms -4% (noise)

Simulation

Benchmark Baseline Optimized Speedup
fast blur: sigma 3.0 254.3 ms 45.3 ms ×5.6
fast blur: sigma 7.0 254.8 ms 45.6 ms ×5.6
fast blur: sigma 50.0 260.6 ms 50.3 ms ×5.2
fast_blur 260.6 ms 50.3 ms ×5.2
gaussian blur: sigma 3.0 159.8 ms 137.1 ms +17%
gaussian blur: sigma 7.0 270.1 ms 247.4 ms +8%
gaussian blur: sigma 50.0 1.38 s 1.36 s unchanged

codspeed-hq bot and others added 7 commits March 12, 2026 22:56
Use (v + 0.5) as u32 instead of .round() (which calls libm roundf).
Under CodSpeed simulation, roundf costs hundreds of instructions per
call; the +0.5 truncation trick is equivalent for the non-negative
range produced by gaussian blur and avoids the libm call entirely.
The box blur hot path used f32 accumulators for all pixel types. For u8
images (the dominant case), this meant every pixel went through
to_f32/from_f32 conversions and software roundf. Replace with u32
integer arithmetic for u8 pixels, dispatched at compile time by pixel
size and channel count (1-4).

~1.64x wall-clock speedup on all fast blur benchmarks.

wip
Precompute ceil(2^32 / kernel_size) once per pass, then use a
multiply-shift to normalize accumulators instead of integer division.
Introduce fast_round_f32 that delegates to hardware rounding
(roundss/frintn) on SSE 4.1 and aarch64, and falls back to the mantissa
snapping trick ((x + 2^23) - 2^23) elsewhere to avoid the costly libm
roundf call.
Clarify that this function only works for positive (non-negative) inputs
and must not be used for signed integer pixel types.
- Move BlurAccum, U8Weight, and rounding_saturating_mul into the
  sealed module alongside PrimitiveSealed
- Add BlurAccum as a supertrait of Primitive, removing the need for
  explicit where bounds and #[expect(private_bounds)] on fast_blur
- Implement BlurAccum for all Primitive types (u8 with u32 accumulators,
  all others with f32)
- Rename acc_zero/acc_scale/acc_to_store to ZERO/scale/to_store
- Simplify rounding_saturating_mul bounds to just T: Primitive
#[inline(always)]
pub(crate) fn fast_round_positive_f32(x: f32) -> f32 {
const MAGIC: f32 = (1u32 << 23) as f32; // 8_388_608.0
(x + MAGIC) - MAGIC
Copy link
Contributor

@awxkee awxkee Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically want rounding to nearest everywhere, this won't round '0.5' to the nearest, this is relatively important if we're unlucky and if alpha channel is blurred to 254.5, then this will make opaque image transparent; this isn't good at all.

This behaviour can be broken by adding EPS to x.
In perfect world we could do (f32::from_bits(x.to_bits() + 1) but world is not perfect, so float number and integral uses different ports and this addition will cost almost twice more, accounting data transfer between units.

This doesn't round negatives properly because this MAGIC is wrong, it should be:

const MAGIC: f32 = ((1u32 << 23) + (1u32 << 22)) as f32;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floating-point numbers round the results of addition (and other operations) using round ties even. So all integers of the form X.5 (X is a positive/negative integer or zero) will be rounded to the nearest even number. There's no way around that. Even the "correct" magic number has that issue.

So I would say that this function is implemented as correctly as can be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant something like this one

Copy link
Contributor

@awxkee awxkee Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well this should correctly break ties almost everywhere (where this method can work, it can't work on the whole range ) because that's should be enough by float definition to break the ties. However, I never performed an exhaustive check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part about changed magic touches only "negatives" because mantissa is wrong, see here

Copy link
Member

@RunDevelopment RunDevelopment left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing stuff @art049!

Before I go into details about the code itself, I'd like to talk about codspeed. Codspeed integration should be a separate PR, so the maintainers of image can discuss it without that blocking your improvements to fast_blur. So please revert the changes to Cargo.toml, README.md, and .github/workflows/codspeed.yml.

I only have a few nits about the code. Please see my comments.

I also want to say that I love the way you documented all the non-trivial fixed-point math tricks. It's easy to understand and verify.

let mut weight1: f32 = 0.;
let mut weight2: f32 = 0.;
let mut weight3: f32 = 0.;
let mut sums = [P::ZERO; CN];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P::ZERO is highly confusing. Took me about 5 minutes to realize that u8::ZERO == 0_u32. Not obvious that this is BlurAccum::ZERO.

Maybe we could rename it to EMPTY_ACCUMULATOR? Or maybe even restructure the trait like I suggested below?

Comment on lines +109 to +121
/// Accumulator abstraction for box blur.
/// `u8` uses `u32` integer accumulators; other types use `f32`.
pub trait BlurAccum: Copy + Sized {
type Acc: Copy + Add<Output = Self::Acc> + AddAssign + Sub<Output = Self::Acc> + SubAssign;
type Weight: Copy;

const ZERO: Self::Acc;

fn to_acc(self) -> Self::Acc;
fn scale(acc: Self::Acc, count: usize) -> Self::Acc;
fn make_weight(kernel_size: usize) -> Self::Weight;
fn to_store(acc: Self::Acc, weight: Self::Weight) -> Self;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a stupid suggestion, but I feel like this trait is the wrong way around.

This trait is implemented for primitives, but the primitives are not the blur accumulators. Self::Acc is. So I think it would be more natural to make the accumulator one trait (separate from Primitive) and give each primitive an associated type like this:

pub trait WithBlurAcc { // sealed supertrait of `Primitive`
    type BlurAcc: BlurAccumulator<Self>;
}

pub(crate) trait BlurAccumulator<T>: Copy + Sized + Add<Output = Self> + AddAssign + Sub<Output = Self> + SubAssign {
    const ZERO: Self;
    fn from_primitive(value: T) -> Self;
    fn scale(self, count: usize) -> Self;

    type Weight: Copy;
    fn create_weight(self, kernel_size: usize) -> Self::Weight;
    fn to_store(self, weight: Self::Weight) -> T;
}

// general implementation
impl<T: Primtive> BlurAccumulator<T> for f32 { ... }
// optimized implementation for `u8`
impl BlurAccumulator<u8> for u32 { ... }

// pick the right implementation
impl WithBlurAcc for u8 { 
    type BlurAcc = u32; 
}
impl WithBlurAcc for u16, u32, u64, usize, i8, i16, i32, i64, isize, f32, f64 { 
    type BlurAcc = f32; 
}

This might be cleaner, since trait BlurAccumulator<T> and its implementations can live inside fast_blur.rs. This will keep all blurring logic local to one file.

(@197g ping for thoughts)

Comment on lines 138 to 139
#[inline]
#[allow(clippy::manual_clamp)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These attributes are left over from fn rounding_saturating_mul and unnecessary. Please remove them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants