Skip to content

linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher)#2

Open
czoli1976 wants to merge 4 commits into
mainfrom
add-wasm-f32-full-kernel-kit
Open

linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher)#2
czoli1976 wants to merge 4 commits into
mainfrom
add-wasm-f32-full-kernel-kit

Conversation

@czoli1976

@czoli1976 czoli1976 commented Apr 28, 2026

Copy link
Copy Markdown
Owner

Summary

Canonical fork-internal PR for the WASM SIMD kernel work in Vonage's DFN3 build chain. The upstream PR to sonos/tract will be opened from this same branch (add-wasm-f32-full-kernel-kit), per kali's preference for a single PR (sonos/tract#2161).

PR #1 (scoped 4x1 only) is closed as superseded.

What this contributes

Provides MM, MV, and a reasonable choice of tile sizes for linalg/src/wasm.rs:

  • MV (nr=1)wasm_f32_4x1, wasm_f32_8x1, wasm_f32_16x1
  • MM (nr>1)wasm_f32_8x8 (alongside existing wasm_f32_4x4)

Selector wiring in wasm::plug():

ops.mmm_f32 = Box::new(|_m, _k, _n| wasm_f32_8x8.mmm());
ops.mmv_f32 = Box::new(|m, _k| match m.unwrap_or(0) {
    0..=7   => wasm_f32_4x1.mmm(),
    8..=15  => wasm_f32_8x1.mmm(),
    _       => wasm_f32_16x1.mmm(),
});

Cumulative impact (DFN3 encoder + erb_dec + df_dec, M1 Pro, Node 20, 27.29s clip)

Layer RTF Δ vs prior Δ vs Mezon
Mezon baseline (no +simd128) 0.1290
+simd128 + existing wasm_f32_4x4 only 0.108 -16.5% -16.5%
+ wasm_f32_4x1 0.0902 -16% -30%
+ wasm_f32_8x1 0.0696 -23% -46%
+ wasm_f32_16x1 0.0559 -20% -57%
+ wasm_f32_8x8 0.0516 -7.7% -60%

Audio output bit-identical to scalar reference (SHA256 match, 27.29s).

8x4 considered and dropped

A wasm_f32_8x4 variant was prototyped (commit 32b23be) and dropped (commit 9c45f8d) after a controlled A/B confirmed it's structurally dead code for DFN3 — every MM op has N ≥ 8, so the strategizer's max(nr*mr) always picks 8x8 over 8x4. Mean A/B delta was -1.22% across 5 alternating cycles (within noise).

Tests

MMMRustKernel! macro auto-generates 54 tests per kernel; full tract-linalg suite passes on wasm32-wasip1 + wasmtime + +simd128.

Companion

czoli1976/DeepFilterNet#1 — DFN3-side integration (tract 0.21 → 0.22.1 + [patch.crates-io] wiring at this branch).

Adds wasm_f32_4x1, a 4-row × 1-col WASM SIMD kernel that fills the gap
between WASM and other targets. Other archs (x86_64 fma, arm64simd,
arm64fp16, generic) all register nr=1 kernels and wire them through
mmv_f32; WASM was the lone outlier with only wasm_f32_4x4 registered,
so kernel_selection::strategize() was forced to pick the 4x4 kernel
for op.n.is_one() cases — wasting 75% of column-tile work per call.

The new kernel:
- Mirrors wasm_f32_4x4's FusedKerSpec match arms exactly
- Uses one f32x4 accumulator (4 rows × 1 col packed)
- Registered with ImplementationQuality::TargetOptimized so the
  strategizer's nr()==1 preference wins over wasm_f32_4x4 for N=1 ops
- mmv_f32 selector now points to wasm_f32_4x1 (mirrors the pattern
  set by other archs in tract-linalg/src/lib.rs::generic())

Bit-equivalent output to scalar generic_f32_4x1 (verified end-to-end
on a streaming RNN model: same SHA256 across 1.3M audio samples
between simd128-only and simd128+gemv builds).

Auto-generated tests via the existing MMMRustKernel! macro provide
54 tests covering every FusedKerSpec arm + frame-level matmul +
property-based random matrix configurations, all passing on
wasm32-wasip1 + wasmtime.

Measured impact on a real streaming-audio model (DFN3 encoder, M1
Pro, Node 20): -17% RTF reduction (RTF 0.108 → 0.090, per-frame
1.077 ms → 0.901 ms), with bit-identical audio output.
Adds four kernels on top of the nr=1 GEMV kernel (parent commit):

  - wasm_f32_8x1   taller GEMV (2 v128 accumulators)
  - wasm_f32_16x1  tallest GEMV (4 v128 accumulators)
  - wasm_f32_8x4   taller MM (8 v128 accumulators)
  - wasm_f32_8x8   squarer MM (16 v128 accumulators, fills WASM
                   logical 16-register limit)

Wires Ops::mmm_f32 to wasm_f32_8x8 and Ops::mmv_f32 to a per-M
dispatcher (4x1 for M<=7, 8x1 for M<=15, 16x1 otherwise).

DFN3 cumulative impact (M1 Pro, Node 20, 27.29s clip):
  +simd128 + 4x4 only:   RTF 0.108
  + 4x1 (parent commit): RTF 0.0902
  + 8x1:                 RTF 0.0696
  + 8x4 + 16x1:          RTF 0.0559
  + 8x8:                 RTF 0.0516

Net vs Mezon production baseline (RTF 0.1290): -60%, 2.5x faster.

Audio output bit-identical to scalar reference (SHA256 match across
1.3M samples / 27.29s of denoised audio). All five new kernels
register at TargetOptimized quality; 1508/1508 tests pass on
wasm32-wasip1 + wasmtime + simd128 (54 tests per new kernel via
auto-generated MMMRustKernel! macro).

Internal-fork branch documenting the full kit. The parent 4x1
branch (add-wasm-f32-4x1-gemv-kernel) is kept separately as the
upstream-PR-ready scope (cf. sonos#2161).
A/B benched with-8x4 vs no-8x4 on DFN3 (5 alternating cycles, 3 runs each):

  Cycle  with-8x4  no-8x4    Δ%
    1     0.0532   0.0516    -3.01%
    2     0.0536   0.0512    -4.48%
    3     0.0513   0.0529    +3.12%
    4     0.0526   0.0518    -1.52%
    5     0.0522   0.0521    -0.19%
                             ───────
                             mean: -1.22%

8x4 is structurally dead code for DFN3 because every MM op has N >= 8,
so the strategizer's max(nr*mr) tiebreaker always picks 8x8 (=64) over
8x4 (=32). Mean -1.22% delta is within thermal/system noise.

Tightens scope to the four kernels that pull their weight:
  - GEMV: wasm_f32_4x1, wasm_f32_8x1, wasm_f32_16x1
  - MM:   wasm_f32_8x8 (alongside existing wasm_f32_4x4)

Removes 268 lines (kernel function + doc + MMMRustKernel! registration).
@czoli1976 czoli1976 changed the title linalg/wasm: full WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x4, 8x8 + per-M dispatcher) linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher) Apr 28, 2026
- Wrap each kernel body in `unsafe { ... }` (matches the pattern
  upstream main uses for `kernel_f32_4x4`); `use std::arch::wasm32::*;`
  stays outside the inner block.
- `cargo fmt` over the full file.
@czoli1976 czoli1976 force-pushed the add-wasm-f32-full-kernel-kit branch from b82d1f0 to d925624 Compare April 28, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant