perf: 5.5x faster fast_blur with u32 accumulators#2846
perf: 5.5x faster fast_blur with u32 accumulators#2846art049 wants to merge 8 commits intoimage-rs:mainfrom
fast_blur with u32 accumulators#2846Conversation
Use (v + 0.5) as u32 instead of .round() (which calls libm roundf). Under CodSpeed simulation, roundf costs hundreds of instructions per call; the +0.5 truncation trick is equivalent for the non-negative range produced by gaussian blur and avoids the libm call entirely.
The box blur hot path used f32 accumulators for all pixel types. For u8 images (the dominant case), this meant every pixel went through to_f32/from_f32 conversions and software roundf. Replace with u32 integer arithmetic for u8 pixels, dispatched at compile time by pixel size and channel count (1-4). ~1.64x wall-clock speedup on all fast blur benchmarks. wip
Precompute ceil(2^32 / kernel_size) once per pass, then use a multiply-shift to normalize accumulators instead of integer division.
Introduce fast_round_f32 that delegates to hardware rounding (roundss/frintn) on SSE 4.1 and aarch64, and falls back to the mantissa snapping trick ((x + 2^23) - 2^23) elsewhere to avoid the costly libm roundf call.
Clarify that this function only works for positive (non-negative) inputs and must not be used for signed integer pixel types.
- Move BlurAccum, U8Weight, and rounding_saturating_mul into the sealed module alongside PrimitiveSealed - Add BlurAccum as a supertrait of Primitive, removing the need for explicit where bounds and #[expect(private_bounds)] on fast_blur - Implement BlurAccum for all Primitive types (u8 with u32 accumulators, all others with f32) - Rename acc_zero/acc_scale/acc_to_store to ZERO/scale/to_store - Simplify rounding_saturating_mul bounds to just T: Primitive
| #[inline(always)] | ||
| pub(crate) fn fast_round_positive_f32(x: f32) -> f32 { | ||
| const MAGIC: f32 = (1u32 << 23) as f32; // 8_388_608.0 | ||
| (x + MAGIC) - MAGIC |
There was a problem hiding this comment.
We typically want rounding to nearest everywhere, this won't round '0.5' to the nearest, this is relatively important if we're unlucky and if alpha channel is blurred to 254.5, then this will make opaque image transparent; this isn't good at all.
This behaviour can be broken by adding EPS to x.
In perfect world we could do (f32::from_bits(x.to_bits() + 1) but world is not perfect, so float number and integral uses different ports and this addition will cost almost twice more, accounting data transfer between units.
This doesn't round negatives properly because this MAGIC is wrong, it should be:
const MAGIC: f32 = ((1u32 << 23) + (1u32 << 22)) as f32;
There was a problem hiding this comment.
Floating-point numbers round the results of addition (and other operations) using round ties even. So all integers of the form X.5 (X is a positive/negative integer or zero) will be rounded to the nearest even number. There's no way around that. Even the "correct" magic number has that issue.
So I would say that this function is implemented as correctly as can be.
There was a problem hiding this comment.
As well this should correctly break ties almost everywhere (where this method can work, it can't work on the whole range ) because that's should be enough by float definition to break the ties. However, I never performed an exhaustive check.
There was a problem hiding this comment.
Part about changed magic touches only "negatives" because mantissa is wrong, see here
RunDevelopment
left a comment
There was a problem hiding this comment.
Amazing stuff @art049!
Before I go into details about the code itself, I'd like to talk about codspeed. Codspeed integration should be a separate PR, so the maintainers of image can discuss it without that blocking your improvements to fast_blur. So please revert the changes to Cargo.toml, README.md, and .github/workflows/codspeed.yml.
I only have a few nits about the code. Please see my comments.
I also want to say that I love the way you documented all the non-trivial fixed-point math tricks. It's easy to understand and verify.
| let mut weight1: f32 = 0.; | ||
| let mut weight2: f32 = 0.; | ||
| let mut weight3: f32 = 0.; | ||
| let mut sums = [P::ZERO; CN]; |
There was a problem hiding this comment.
P::ZERO is highly confusing. Took me about 5 minutes to realize that u8::ZERO == 0_u32. Not obvious that this is BlurAccum::ZERO.
Maybe we could rename it to EMPTY_ACCUMULATOR? Or maybe even restructure the trait like I suggested below?
| /// Accumulator abstraction for box blur. | ||
| /// `u8` uses `u32` integer accumulators; other types use `f32`. | ||
| pub trait BlurAccum: Copy + Sized { | ||
| type Acc: Copy + Add<Output = Self::Acc> + AddAssign + Sub<Output = Self::Acc> + SubAssign; | ||
| type Weight: Copy; | ||
|
|
||
| const ZERO: Self::Acc; | ||
|
|
||
| fn to_acc(self) -> Self::Acc; | ||
| fn scale(acc: Self::Acc, count: usize) -> Self::Acc; | ||
| fn make_weight(kernel_size: usize) -> Self::Weight; | ||
| fn to_store(acc: Self::Acc, weight: Self::Weight) -> Self; | ||
| } |
There was a problem hiding this comment.
This might be a stupid suggestion, but I feel like this trait is the wrong way around.
This trait is implemented for primitives, but the primitives are not the blur accumulators. Self::Acc is. So I think it would be more natural to make the accumulator one trait (separate from Primitive) and give each primitive an associated type like this:
pub trait WithBlurAcc { // sealed supertrait of `Primitive`
type BlurAcc: BlurAccumulator<Self>;
}
pub(crate) trait BlurAccumulator<T>: Copy + Sized + Add<Output = Self> + AddAssign + Sub<Output = Self> + SubAssign {
const ZERO: Self;
fn from_primitive(value: T) -> Self;
fn scale(self, count: usize) -> Self;
type Weight: Copy;
fn create_weight(self, kernel_size: usize) -> Self::Weight;
fn to_store(self, weight: Self::Weight) -> T;
}
// general implementation
impl<T: Primtive> BlurAccumulator<T> for f32 { ... }
// optimized implementation for `u8`
impl BlurAccumulator<u8> for u32 { ... }
// pick the right implementation
impl WithBlurAcc for u8 {
type BlurAcc = u32;
}
impl WithBlurAcc for u16, u32, u64, usize, i8, i16, i32, i64, isize, f32, f64 {
type BlurAcc = f32;
}This might be cleaner, since trait BlurAccumulator<T> and its implementations can live inside fast_blur.rs. This will keep all blurring logic local to one file.
(@197g ping for thoughts)
| #[inline] | ||
| #[allow(clippy::manual_clamp)] |
There was a problem hiding this comment.
These attributes are left over from fn rounding_saturating_mul and unnecessary. Please remove them.
Following #2809
This was already reviewed a bit here
Changes
Results
Walltime on x86 on a build without sse4.1
Simulation results from CodSpeed
Walltime
Simulation