linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher)#2
Open
czoli1976 wants to merge 4 commits into
Open
linalg/wasm: WASM SIMD kernel kit (4x1, 8x1, 16x1, 8x8 + per-M dispatcher)#2czoli1976 wants to merge 4 commits into
czoli1976 wants to merge 4 commits into
Conversation
Adds wasm_f32_4x1, a 4-row × 1-col WASM SIMD kernel that fills the gap between WASM and other targets. Other archs (x86_64 fma, arm64simd, arm64fp16, generic) all register nr=1 kernels and wire them through mmv_f32; WASM was the lone outlier with only wasm_f32_4x4 registered, so kernel_selection::strategize() was forced to pick the 4x4 kernel for op.n.is_one() cases — wasting 75% of column-tile work per call. The new kernel: - Mirrors wasm_f32_4x4's FusedKerSpec match arms exactly - Uses one f32x4 accumulator (4 rows × 1 col packed) - Registered with ImplementationQuality::TargetOptimized so the strategizer's nr()==1 preference wins over wasm_f32_4x4 for N=1 ops - mmv_f32 selector now points to wasm_f32_4x1 (mirrors the pattern set by other archs in tract-linalg/src/lib.rs::generic()) Bit-equivalent output to scalar generic_f32_4x1 (verified end-to-end on a streaming RNN model: same SHA256 across 1.3M audio samples between simd128-only and simd128+gemv builds). Auto-generated tests via the existing MMMRustKernel! macro provide 54 tests covering every FusedKerSpec arm + frame-level matmul + property-based random matrix configurations, all passing on wasm32-wasip1 + wasmtime. Measured impact on a real streaming-audio model (DFN3 encoder, M1 Pro, Node 20): -17% RTF reduction (RTF 0.108 → 0.090, per-frame 1.077 ms → 0.901 ms), with bit-identical audio output.
Adds four kernels on top of the nr=1 GEMV kernel (parent commit):
- wasm_f32_8x1 taller GEMV (2 v128 accumulators)
- wasm_f32_16x1 tallest GEMV (4 v128 accumulators)
- wasm_f32_8x4 taller MM (8 v128 accumulators)
- wasm_f32_8x8 squarer MM (16 v128 accumulators, fills WASM
logical 16-register limit)
Wires Ops::mmm_f32 to wasm_f32_8x8 and Ops::mmv_f32 to a per-M
dispatcher (4x1 for M<=7, 8x1 for M<=15, 16x1 otherwise).
DFN3 cumulative impact (M1 Pro, Node 20, 27.29s clip):
+simd128 + 4x4 only: RTF 0.108
+ 4x1 (parent commit): RTF 0.0902
+ 8x1: RTF 0.0696
+ 8x4 + 16x1: RTF 0.0559
+ 8x8: RTF 0.0516
Net vs Mezon production baseline (RTF 0.1290): -60%, 2.5x faster.
Audio output bit-identical to scalar reference (SHA256 match across
1.3M samples / 27.29s of denoised audio). All five new kernels
register at TargetOptimized quality; 1508/1508 tests pass on
wasm32-wasip1 + wasmtime + simd128 (54 tests per new kernel via
auto-generated MMMRustKernel! macro).
Internal-fork branch documenting the full kit. The parent 4x1
branch (add-wasm-f32-4x1-gemv-kernel) is kept separately as the
upstream-PR-ready scope (cf. sonos#2161).
A/B benched with-8x4 vs no-8x4 on DFN3 (5 alternating cycles, 3 runs each):
Cycle with-8x4 no-8x4 Δ%
1 0.0532 0.0516 -3.01%
2 0.0536 0.0512 -4.48%
3 0.0513 0.0529 +3.12%
4 0.0526 0.0518 -1.52%
5 0.0522 0.0521 -0.19%
───────
mean: -1.22%
8x4 is structurally dead code for DFN3 because every MM op has N >= 8,
so the strategizer's max(nr*mr) tiebreaker always picks 8x8 (=64) over
8x4 (=32). Mean -1.22% delta is within thermal/system noise.
Tightens scope to the four kernels that pull their weight:
- GEMV: wasm_f32_4x1, wasm_f32_8x1, wasm_f32_16x1
- MM: wasm_f32_8x8 (alongside existing wasm_f32_4x4)
Removes 268 lines (kernel function + doc + MMMRustKernel! registration).
- Wrap each kernel body in `unsafe { ... }` (matches the pattern
upstream main uses for `kernel_f32_4x4`); `use std::arch::wasm32::*;`
stays outside the inner block.
- `cargo fmt` over the full file.
b82d1f0 to
d925624
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Canonical fork-internal PR for the WASM SIMD kernel work in Vonage's DFN3 build chain. The upstream PR to
sonos/tractwill be opened from this same branch (add-wasm-f32-full-kernel-kit), per kali's preference for a single PR (sonos/tract#2161).PR #1 (scoped 4x1 only) is closed as superseded.
What this contributes
Provides MM, MV, and a reasonable choice of tile sizes for
linalg/src/wasm.rs:wasm_f32_4x1,wasm_f32_8x1,wasm_f32_16x1wasm_f32_8x8(alongside existingwasm_f32_4x4)Selector wiring in
wasm::plug():Cumulative impact (DFN3 encoder + erb_dec + df_dec, M1 Pro, Node 20, 27.29s clip)
+simd128)+simd128+ existingwasm_f32_4x4onlywasm_f32_4x1wasm_f32_8x1wasm_f32_16x1wasm_f32_8x8Audio output bit-identical to scalar reference (SHA256 match, 27.29s).
8x4 considered and dropped
A
wasm_f32_8x4variant was prototyped (commit 32b23be) and dropped (commit 9c45f8d) after a controlled A/B confirmed it's structurally dead code for DFN3 — every MM op has N ≥ 8, so the strategizer'smax(nr*mr)always picks8x8over8x4. Mean A/B delta was -1.22% across 5 alternating cycles (within noise).Tests
MMMRustKernel!macro auto-generates 54 tests per kernel; full tract-linalg suite passes onwasm32-wasip1+ wasmtime ++simd128.Companion
czoli1976/DeepFilterNet#1 — DFN3-side integration (tract 0.21 → 0.22.1 +
[patch.crates-io]wiring at this branch).