linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher) by czoli1976 · Pull Request #2 · czoli1976/tract

czoli1976 · 2026-04-28T06:21:45Z

Summary

Canonical fork-internal PR for the WASM SIMD kernel work in Vonage's DFN3 build chain. The upstream PR to sonos/tract will be opened from this same branch (add-wasm-f32-full-kernel-kit), per kali's preference for a single PR (sonos/tract#2161).

PR #1 (scoped 4x1 only) is closed as superseded.

What this contributes

Provides MM, MV, and a reasonable choice of tile sizes for linalg/src/wasm.rs:

MV (nr=1) — wasm_f32_4x1, wasm_f32_8x1, wasm_f32_16x1
MM (nr>1) — wasm_f32_8x8 (alongside existing wasm_f32_4x4)

Selector wiring in wasm::plug():

ops.mmm_f32 = Box::new(|_m, _k, _n| wasm_f32_8x8.mmm());
ops.mmv_f32 = Box::new(|m, _k| match m.unwrap_or(0) {
    0..=7   => wasm_f32_4x1.mmm(),
    8..=15  => wasm_f32_8x1.mmm(),
    _       => wasm_f32_16x1.mmm(),
});

Cumulative impact (DFN3 encoder + erb_dec + df_dec, M1 Pro, Node 20, 27.29s clip)

Layer	RTF	Δ vs prior	Δ vs Mezon
Mezon baseline (no `+simd128`)	0.1290	—	—
`+simd128` + existing `wasm_f32_4x4` only	0.108	-16.5%	-16.5%
+ `wasm_f32_4x1`	0.0902	-16%	-30%
+ `wasm_f32_8x1`	0.0696	-23%	-46%
+ `wasm_f32_16x1`	0.0559	-20%	-57%
+ `wasm_f32_8x8`	0.0516	-7.7%	-60%

Audio output bit-identical to scalar reference (SHA256 match, 27.29s).

8x4 considered and dropped

A wasm_f32_8x4 variant was prototyped (commit 32b23be) and dropped (commit 9c45f8d) after a controlled A/B confirmed it's structurally dead code for DFN3 — every MM op has N ≥ 8, so the strategizer's max(nr*mr) always picks 8x8 over 8x4. Mean A/B delta was -1.22% across 5 alternating cycles (within noise).

Tests

MMMRustKernel! macro auto-generates 54 tests per kernel; full tract-linalg suite passes on wasm32-wasip1 + wasmtime + +simd128.

Companion

czoli1976/DeepFilterNet#1 — DFN3-side integration (tract 0.21 → 0.22.1 + [patch.crates-io] wiring at this branch).

Adds wasm_f32_4x1, a 4-row × 1-col WASM SIMD kernel that fills the gap between WASM and other targets. Other archs (x86_64 fma, arm64simd, arm64fp16, generic) all register nr=1 kernels and wire them through mmv_f32; WASM was the lone outlier with only wasm_f32_4x4 registered, so kernel_selection::strategize() was forced to pick the 4x4 kernel for op.n.is_one() cases — wasting 75% of column-tile work per call. The new kernel: - Mirrors wasm_f32_4x4's FusedKerSpec match arms exactly - Uses one f32x4 accumulator (4 rows × 1 col packed) - Registered with ImplementationQuality::TargetOptimized so the strategizer's nr()==1 preference wins over wasm_f32_4x4 for N=1 ops - mmv_f32 selector now points to wasm_f32_4x1 (mirrors the pattern set by other archs in tract-linalg/src/lib.rs::generic()) Bit-equivalent output to scalar generic_f32_4x1 (verified end-to-end on a streaming RNN model: same SHA256 across 1.3M audio samples between simd128-only and simd128+gemv builds). Auto-generated tests via the existing MMMRustKernel! macro provide 54 tests covering every FusedKerSpec arm + frame-level matmul + property-based random matrix configurations, all passing on wasm32-wasip1 + wasmtime. Measured impact on a real streaming-audio model (DFN3 encoder, M1 Pro, Node 20): -17% RTF reduction (RTF 0.108 → 0.090, per-frame 1.077 ms → 0.901 ms), with bit-identical audio output.

Adds four kernels on top of the nr=1 GEMV kernel (parent commit): - wasm_f32_8x1 taller GEMV (2 v128 accumulators) - wasm_f32_16x1 tallest GEMV (4 v128 accumulators) - wasm_f32_8x4 taller MM (8 v128 accumulators) - wasm_f32_8x8 squarer MM (16 v128 accumulators, fills WASM logical 16-register limit) Wires Ops::mmm_f32 to wasm_f32_8x8 and Ops::mmv_f32 to a per-M dispatcher (4x1 for M<=7, 8x1 for M<=15, 16x1 otherwise). DFN3 cumulative impact (M1 Pro, Node 20, 27.29s clip): +simd128 + 4x4 only: RTF 0.108 + 4x1 (parent commit): RTF 0.0902 + 8x1: RTF 0.0696 + 8x4 + 16x1: RTF 0.0559 + 8x8: RTF 0.0516 Net vs Mezon production baseline (RTF 0.1290): -60%, 2.5x faster. Audio output bit-identical to scalar reference (SHA256 match across 1.3M samples / 27.29s of denoised audio). All five new kernels register at TargetOptimized quality; 1508/1508 tests pass on wasm32-wasip1 + wasmtime + simd128 (54 tests per new kernel via auto-generated MMMRustKernel! macro). Internal-fork branch documenting the full kit. The parent 4x1 branch (add-wasm-f32-4x1-gemv-kernel) is kept separately as the upstream-PR-ready scope (cf. sonos#2161).

A/B benched with-8x4 vs no-8x4 on DFN3 (5 alternating cycles, 3 runs each): Cycle with-8x4 no-8x4 Δ% 1 0.0532 0.0516 -3.01% 2 0.0536 0.0512 -4.48% 3 0.0513 0.0529 +3.12% 4 0.0526 0.0518 -1.52% 5 0.0522 0.0521 -0.19% ─────── mean: -1.22% 8x4 is structurally dead code for DFN3 because every MM op has N >= 8, so the strategizer's max(nr*mr) tiebreaker always picks 8x8 (=64) over 8x4 (=32). Mean -1.22% delta is within thermal/system noise. Tightens scope to the four kernels that pull their weight: - GEMV: wasm_f32_4x1, wasm_f32_8x1, wasm_f32_16x1 - MM: wasm_f32_8x8 (alongside existing wasm_f32_4x4) Removes 268 lines (kernel function + doc + MMMRustKernel! registration).

- Wrap each kernel body in `unsafe { ... }` (matches the pattern upstream main uses for `kernel_f32_4x4`); `use std::arch::wasm32::*;` stays outside the inner block. - `cargo fmt` over the full file.

czoli1976 added 2 commits April 27, 2026 18:12

czoli1976 mentioned this pull request Apr 28, 2026

DFN3 WASM optimization: tract 0.21 -> 0.22.1 + WASM SIMD kernel kit czoli1976/DeepFilterNet#1

Open

czoli1976 mentioned this pull request Apr 28, 2026

linalg/wasm: add nr=1 (GEMV) f32x4 kernel #1

Closed

czoli1976 changed the title ~~linalg/wasm: full WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x4, 8x8 + per-M dispatcher)~~ linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher) Apr 28, 2026

linalg/wasm: address review feedback — inner unsafe blocks + cargo fmt

d925624

- Wrap each kernel body in `unsafe { ... }` (matches the pattern upstream main uses for `kernel_f32_4x4`); `use std::arch::wasm32::*;` stays outside the inner block. - `cargo fmt` over the full file.

czoli1976 force-pushed the add-wasm-f32-full-kernel-kit branch from b82d1f0 to d925624 Compare April 28, 2026 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher)#2

linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher)#2
czoli1976 wants to merge 4 commits into
mainfrom
add-wasm-f32-full-kernel-kit

czoli1976 commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this contributes

Cumulative impact (DFN3 encoder + erb_dec + df_dec, M1 Pro, Node 20, 27.29s clip)

8x4 considered and dropped

Tests

Companion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

czoli1976 commented Apr 28, 2026 •

edited

Loading