linalg/arm64: NEON-fp16 hardswish_f16, silu_f16, gelu_f16 kernels by czoli1976 · Pull Request #12 · czoli1976/tract

czoli1976 · 2026-05-28T18:32:06Z

Summary

Fills the three element-wise f16 slots in arm64fp16::plug that were previously falling through to the generic scalar fallback. Brings the arm64fp16 element-wise coverage to parity with arm64simd's f32 set (sigmoid, tanh, hardswish, silu, gelu, leaky_relu).

This is the ARM half of the asymmetry the AVX-512 PR set (sonos#2303-sonos#2307, sonos#2310, sonos#2311) flagged in the maintainer review pack: x86 already covers all 6 f16 activations via sonos#2310 / #8; on ARM, only 3 of 6 were f16-accelerated. After this PR all 6 fire on arm64fp16.

Independent of the RmsNorm work in sonos#2311 / #9 / #11. Targets base/sonos-main directly.

What

Op	Implementation	Notes
`hardswish_f16`	Full native NEON-fp16 inline asm. 16 f16 / iter via `ldp q4, q5` + `fadd/fmin/fmax/fmul .8h`.	Pure-fp16 hot path; no transcendental, no f32 round-trip. Mirrors `arm64simd_hardswish_f32_8n` with `.8h` lanes.
`silu_f16`	Compose-via-f32-scratch (CHUNK = 128). Each chunk f16→f32 (`fcvtl` via auto-vec `.to_f32()`), `arm64simd_sigmoid_f32_4n` in place, multiply by saved original, f32→f16 back.	Mirrors the AVX-512 x86 #8 / sonos sonos#2310 pattern. Routes through f32 for precision because the sigmoid+multiply chain in f16 would risk SuperApproximate tolerance failures.
`gelu_f16`	Same compose-via-f32-scratch pattern. Cube `xxx` and pre-scale by `sqrt(2/pi)` in f32, `arm64simd_tanh_f32_4n` in place, finish `0.5 * x * (1 + tanh)` in f32, f32→f16 back.	f32 internal because f16's 11-bit mantissa loses precision on the cube for the `

Plugged in arm64::plug() next to the existing tanh_f16 / sigmoid_f16 / leaky_relu_f16 lines, behind the same has_fp16() gate. Hosts without fp16 (older ARMv8.0) keep falling through to the generic scalar kernels.

Risk

fp16 is gated via has_fp16() so non-fp16 hosts skip the plug step entirely.
Pure addition: no existing kernel modified, no Ops field semantics changed.
The compose patterns for silu_f16 and gelu_f16 reuse arm64simd_sigmoid_f32_4n and arm64simd_tanh_f32_4n which are already validated upstream.

Test plan

cargo check --target aarch64-unknown-linux-gnu -p tract-linalg — clean
cargo check --target aarch64-unknown-linux-gnu -p tract-linalg --tests — clean
cargo check -p tract-linalg (host x86_64) — unchanged
cargo fmt --all -- --check clean
cargo clippy clean on the new files
Runtime tests on real aarch64 hardware — frame tests (hardswish_frame_tests! / silu_frame_tests! / gelu_frame_tests! against the scalar f16 reference at SuperApproximate tolerance) compile cross-target; please run on Apple Silicon / Cortex-A before merging. The kernels mirror established algorithms but no native aarch64 runner was available locally during authoring.
End-to-end bench on aarch64 — arm64fp16_silu_f16_8n and arm64fp16_gelu_f16_8n should be substantially faster than the generic scalar f16 baseline (which goes per-element through f16.to_f32() then scalar transcendentals). Expected shape similar to the AVX-512 linalg/x86_64: add AVX-512 f16 element-wise activations (stacked on #2304) sonos/tract#2310 numbers (4.6× silu, 6.7× gelu over generic) — actual ratios depend on Apple Silicon / Cortex-A fp16 throughput.

Notes for review

The compose-via-f32-scratch decision for silu/gelu (instead of full native fp16 over arm64fp16_sigmoid_f16_8n / arm64fp16_tanh_f16_8n) is deliberate but reversible. A future revisit could write fully-native-fp16 versions that stay in .8h lanes throughout — that would skip the conversion at the IO boundary and might be 1.5-2× faster on Apple Silicon (which has 8-wide native f16 SIMD versus 4-wide f32). Trade-off is precision on the cube term; SuperApproximate tolerance may not hold without f32 internal accumulation. Easy follow-up if the perf delta motivates it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Generated by Claude Code

`TypedModelPatch::shunt_outside` leaves the shunted node in the graph, but the NNEF `patch` transform also implicitly removed model inputs whose name appeared on the LHS. That hidden side-effect made `patch` do two things at once: substitute a wire, and trim the interface. Drop the trimming. Add a sibling `select_inputs(inputs: [...])` transform shaped like `select_outputs`. The pulse pipeline now reads: -t 'patch(body: "length = tract_core_shape_of(input_signal)[1];")' \ -t 'select_inputs(inputs: ["input_signal"])' \ -t 'select_outputs(outputs: ["processed_signal"])' \ -t 'pulse(symbol: ..., pulse: ...)' Discarded Sources stay in the graph until declutter prunes them. Wire-up: `Graph::select_inputs_by_name` (mirror of `select_outputs_by_name`) + `with_inputs_by_name` + transform registration. Updated harness/nemotron + nemo-nemotron-asr + nemo-nemotron-streaming-asr to add the explicit `select_inputs` step.

The 'without-default-features' job in full.yml (cargo check -p tract-cli --no-default-features) regressed after the cuda-12XXX split: cudarc and tract-cuda were still pulled in unconditionally on linux/windows targets, so stripping the cuda-13000 default left cudarc with no API-version feature and its build script panicked. Make both deps optional in tract-cli and tract-libcli, and have each cuda-XXXXX feature pull them in (dep:cudarc + dep:tract-cuda + tract-cuda/cuda-XXXXX + tract-libcli/cuda). Adds a marker 'cuda' feature so cudarc-touching code in bench.rs / dump.rs / libcli/lib.rs can gate cleanly. test-cuda explicitly opts into cuda-13000 (workspace dep has default-features=false now), so 'cargo test -p tract-cuda -p test-cuda' keeps building.

Unify the four overlapping names for 'bind a symbol to a value across the model graph' under one verb: - core: `TypedModel::substitute_symbols` → `set_symbols` - core: `TypedOp::substitute_symbols` trait method → `set_symbols` - transform name: `concretize_symbols(values: …)` → `set_symbols(values: …)` - Rust API: `ConcretizeSymbols` → `SetSymbols` - Python API: `tract.ConcretizeSymbols` → `tract.SetSymbols` The CLI `--set B=1` flag was already aligned and is unchanged. No deprecation aliases — hard rename across cli, harness scripts, examples and Python bindings. The Rust API builder gains a `SetSymbols::expr(name, str)` companion to `value(name, i64)` so callers can pass TDim expressions (e.g. `'2*S'`) the way the CLI `--set` and the transform already do. `TDim::substitute` / `TDim::substitute_all` are unchanged: they operate on a single TDim expression, not on the model, and "substitute" is the accurate verb for that level.

The top-level `--set` flag was already TDim-aware via `parse_set_subs` in params.rs; the `run` subcommand had a parallel `--set` flag that only accepted plain i64. Parse RHS as a TDim against the model's symbol scope and reduce to i64 with the symbols set so far on the command line, so `run --set FOO=2 --set T=2*FOO` resolves cleanly. Order is CLI-significant: a symbol referenced on the RHS must be set to its left. Errors out with the unresolved name in the message.

The optimized Scan body runs the same plan with the same shapes every timestep, so resolve its symbols once, reset between iters without discarding them (reset_turn_keep_symbols), and reuse one drained input buffer -- instead of a full model_state.run() cycle (set_inputs -> resolve_symbols -> exec -> outputs -> reset_turn) per timestep. Bit-identical to the old path across GRU/LSTM/RNN + df_dec. No measurable wall-clock impact on fixed main (within +/-1% noise on gru/lstm/rnn 128/50 & 256/100 and df_dec, single-thread); kept as a cleanup of the per-iter re-entry path, not as a perf change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prefill-only GroupQueryAttention lowered onto tract Sdpa: reshapes Q/K/V to 4D, applies an explicit lower-triangular causal mask, and returns present_key/present_value (the reshaped K/V). Sdpa handles the grouped-query head sharing (kv_num_heads < num_heads). Decode-step KV cache, internal rotary (do_rotary), local-window attention and softcap are rejected with clear errors. Validated against onnxruntime across head_size 8/16/64, several num_heads/kv_num_heads ratios (incl. multi-query kv=1) and batch>1: attention output matches to <=3.6e-7 and present_key/present_value are bit-exact. ORT's GroupQueryAttention prefill is standard causal grouped-query attention; the seqlens_k input is the 0-indexed position of the last token (total_sequence_length - 1), not the token count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pin the resolved dependency graph so debug builds, release artifacts, SBOMs and security audits all see the same versions. CI runs against this lockfile; `cargo update` is the explicit knob for bumping deps.

For each tract-cli release artifact (per target triple), generate both CycloneDX and SPDX SBOMs from the workspace (Cargo.lock-driven) via anchore/sbom-action and upload them alongside the .tgz. Also pass --locked to the release build so the SBOM matches the resolved deps exactly. The sbom-action ref is currently the v0.18.0 tag — dependabot github-actions runs weekly and will SHA-pin on its next pass.

Each tract-cli release tarball now gets two GitHub attestations (CycloneDX and SPDX SBOM, via actions/attest-sbom). Anyone can verify after download with: gh attestation verify tract-<triple>-<version>.tgz --owner sonos Requires `id-token: write` + `attestations: write` on the job. sbom-action's `upload-release-assets: false` keeps the SBOM files out of its own upload path so the explicit softprops step is the single source of release artifacts.

- cargo auditable wraps the release build so the resolved dependency graph lands inside the binary itself. Consumers can recover the SBOM with `cargo audit bin tract` without needing the published .cdx.json / .spdx.json files. - actions/attest-build-provenance@v2 signs the .tgz with provenance metadata (workflow ref, commit SHA, runner). Combined with the existing SBOM attestations this lands at SLSA Build Level 3.

Pinned commits (latest stable as of writing): - anchore/sbom-action @ e22c389 (v0.24.0) - actions/attest-sbom @ c604332 (v4.1.0) - actions/attest-build-provenance @ a2bbfa2 (v4.1.0) Matches the existing SHA + comment convention used for actions/checkout and softprops/action-gh-release; dependabot's github-actions group will keep them current.

Two-part change so consumers can audit the deps that landed in the tract Python wheel without needing to re-clone the Rust workspace: 1. `api/py/pyproject.toml` (Linux + macOS cibuildwheel before-build): install cargo-auditable and write a one-line bash shim that prefixes `auditable` to every cargo invocation. setuptools_rust honours $CARGO (build.py:97), so pointing CARGO at the shim makes the Rust .so inside the wheel carry its dep graph in the `.dep-v0` ELF/Mach-O section. Windows wheels stay as-is for now (TODO comment). 2. `.github/workflows/wheels.yml` + `.github/scripts/inject_wheel_sboms.py`: after cibuildwheel emits each .whl, install syft (via anchore/sbom-action/download-syft, SHA-pinned), unpack the wheel, scan its contents (syft's rust-audit-binary cataloger reads the embedded cargo-auditable section), drop sbom.cdx.json + sbom.spdx.json into `<dist-info>/sboms/` per PEP 770, and re-pack via `wheel pack` (which regenerates RECORD with hashes). Smoke-tested locally on a sample wheel: SBOMs end up at the right path and RECORD has correct sha256 entries.

atty (0.2.x) is unmaintained and triggers RUSTSEC-2021-0145 on SBOM audits. It's only used in two places — both `is stderr a TTY` checks in `tract hwbench` — and std::io::IsTerminal (stable since 1.70, well below tract's MSRV) is a drop-in. `cargo tree -i atty` after the change reports the crate is no longer in the workspace dep graph.

runtime_for_name("gpu") → first GPU backend whose `check()` passes (metal, then cuda); error if none are available. runtime_for_name("gpu-or-cpu") → same lookup, but falls through to the `default` CPU runtime instead of erroring. No new mechanism — both names walk the existing inventory and use each backend's existing `check()` to decide availability. Backend-specific names (`cuda`, `metal`) still work as before.

…or_name The CPU runtime now reports its own name as `cpu` (which is what it is), so `list-runtimes` shows `cpu`, `cuda`, `metal` … instead of the misleading `default`. Back-compat for callers passing `default` is handled by a one-line alias in `runtime_for_name` rather than by registering two runtimes or by polluting the trait — the alias only affects name lookup, not the inventory.

The `tensorflow` 0.21.0 crate (Rust binding for libtensorflow) was only pulled in behind the dead `conform` cargo feature — which gated `tract compare --tf` (compare tract output against running on libtensorflow on the same model). The feature isn't enabled in any GitHub workflow; only a stranded `.travis/tf.sh` ever ran it. The upstream `tensorflow` crate hasn't shipped since 2023-08-15 and pins to rust-protobuf 2.27.x, which trips RUSTSEC-2024-0437. Drop the feature and all its plumbing. Tract's own `.pb` parsing (used by `-t transformers_detect_all` and the `tf` cargo feature in tract-cli) goes through prost and is unaffected — the `tract-tensorflow` crate stays, just without the libtensorflow runtime. Cargo.lock shrinks by ~350 lines as a side-effect.

The LayerNorm op's `wire` expansion casts `normalized` back to fact.datum_type *before* applying scale/bias, then multiplies that result with `cast_scale` (which is still in self.datum_type, F32). With F16 inputs this becomes F16 × F32, whose output is downgraded to F32 by `mul()`. The inference rule then asserts `outputs[0].datum_type == inputs[0].datum_type` (F16) against the actual F32 output, failing `into_typed()` with: Output mismatch after rewiring expansion for output #0: expected 1,256,384,F16 got 1,256,384,F32 Fix: defer the cast back to fact.datum_type until after all scale/bias operations. Now the expansion stays entirely in self.datum_type (F32) through normalized × scale + bias, and casts only the final result. Behavior is unchanged for F32 inputs (the final cast is a no-op when fact.datum_type == self.datum_type). Reproduced with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 exported via `optimum.exporters.onnx.main_export(..., dtype="fp16")` and loaded with `into_optimized().into_runnable()`.

The single-thread MMM tile walk used a naive nested loop, re-streaming the full inner operand (all of A in col-outer / B in row-outer) per panel at large k, which is memory/L1-bound. The multithread path already 2D-blocks the panel grid (chunk_grid); this brings the same blocking to the single-thread path, with the block edge cache-derived (detected L2/3, conservative 256 KiB fallback) so it stays L2-resident across hardware and never over-blocks a cache it cannot see. Bit-identical: it only reorders independent tiles (each computes its full-k reduction into a disjoint C region). The block-edge floor of 1 degrades exactly to the naive loop; the cap of 16 matches the multithread chunk_grid blocking already shipped on all platforms. Frame-level, so all kernels benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV / multithreaded shapes are unchanged. Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against the naive reference (the existing frame proptests only reach 3 panels). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant (it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt --check` failed in CI. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Passing 'v0.24.0' to post-release.sh writes 'version = "v0.24.0"' into every Cargo.toml — invalid semver, breaks the workspace, easy to do by muscle memory because the git tag does carry the 'v' prefix. Bail out early in both release.sh and post-release.sh when the argument doesn't match an unprefixed semver.

Hand-apply the suggestions clippy --fix can't (then_some, clamp, sort_by_key, drop of non-Drop views, an iterator rewrite, is_empty alongside len, merged identical cast branches) and local-allow the rest per policy (too_many_arguments / type_complexity on internal items; one needless_range_loop where the range is the pool rank, not the indexed slice's length).

P0 of the all-Rust bench toolkit: split the bench and llm-bench handlers into a run() that measures and returns structured metrics (BenchResult { metrics, iters }) and a thin handle() that just prints them. No behavior/output change — the subcommands are byte-identical — but the bench suite can now call run() and consume data instead of parsing stdout. Readings (time_to/rsz/active) come from the build probe and land in P1.

The bench-suite orchestrator spawns one fresh tract per bench and reads its metrics back over stdout. In --emit-jsonl mode the bench/llm-bench handlers print one {"metric":..,"value":..} line per metric and nothing else; logs already go to stderr, so stdout stays pure JSONL.

In --emit-jsonl mode the child parses its own readings.out and appends time_to/rsz/active for the model_ready and before_optimize checkpoints, replacing the shell's grep/sed/cut scraping with a typed parser. Values match the old scraping byte-for-byte on the same readings file.

tract bench-suite reads a TOML manifest and runs each bench in a fresh child process via --emit-jsonl, so memory readings get the cold process they need. It parses each child's JSONL stdout (any non-metric line is a hard failure for that run), prefixes the metric names, fetches models over HTTPS, pins the governor, and writes the metrics file -- replacing bundle-entrypoint.sh's fetch/scrape/retry shell. benches.toml ports the full bundle bench list. Feature-gated behind bench-suite (pulls toml/tar).

Add a bench_common module (series noise, red threshold, median, higher- better) and a bench-expectations subcommand that emits the per-metric 'metric expected threshold' file from bench-data history -- output is byte-identical to bench-expectations.py. bench-suite now takes that file and re-runs a bench, fresh child each time, while it stays out of threshold (same bar as a PR red), keeping the per-metric best up to --retry-max times; without it the suite is single-shot.

bench-suite can now build retry expectations itself from a bench-data checkout (--bench-data/--thresholds/--triple/--device) instead of requiring a pre-generated --expectations file. The Python script needed a separate job because it could not run on the constrained runners; the Rust path has no such limit, so the bench host can do it in one command. The standalone bench-expectations subcommand stays, sharing compute().

The bench step now runs 'tract bench-suite' instead of bundle-entrypoint.sh: it fetches models over HTTPS, builds this device's retry expectations inline from a read-only bench-data checkout, and retries. That removes the separate expectations job (its only reason was running bench-expectations.py off the Jetson, which lacks tomllib) and the AWS OIDC creds step (HTTPS fetch needs no auth once the model/ prefix is public). The cli is built --features bench-suite (cross.sh adds it for the Jetson). Append and report stay Python for now. Requires the tract-ci-builds model/ prefix to be public-read; until then the HTTPS fetch 403s.

tract bench-report replaces bench-report.py: it reads the per-device results + bench-data reference, computes movers with the shared bench_common red logic (now single-sourced with retry), and renders the PR comment + job summary from editable minijinja templates in .travis/. PR-comment output is byte-identical to the Python; the job summary matches but for one trailing blank line. read_metrics now keeps file order so equal-delta ties sort as before.

tract bench-append replaces bench-append.py: 4-sig-fig half-to-even rounding, null-padded columnar arrays, indent=0 one-value-per-line JSON -- byte-identical to the Python on fresh and append (so git diffs stay small). The bench workflow's nightly append and PR report steps now call the tract binary (the report job downloads the x86_64 artifact), so no job runs Python. Deletes bench_common.py, bench-expectations.py, bench-report.py, bench-append.py: the bench pipeline is Python-free.

zizmor flagged template-injection: ${{ needs.prepare.outputs.ref }} was expanded straight into the report step's run block. Route it through a PR_SHA env var like the rest of the workflow.

Carry the still-relevant fix/bench-retry-bugs fixes into the Rust suite (that PR edits the shell/python files this branch deletes): - single-source the comparison baseline as the median of recent non-null (bench_common::reference_value), used by both bench-expectations and the report, so retry and the PR red judge the same number (retry==red); - report writes no comment when there are no comparable metrics, and the workflow gates the report on !cancelled() + skips posting a missing file, so a superseded run can't overwrite a real comment with 'no regressions'; - pin the GPU graphics clock for the run, reset on drop (cuda free-boost variance), alongside the CPU governor pin. Also build/clippy/test the bench-suite feature in the crates workflow (it is not a default feature, so nothing else compiled it), and add unit tests for the threshold math.

The bench step hardcoded --cache-dir, overriding each self-hosted runner's CACHEDIR env (which points at its curated model cache) and forcing a full re-fetch into a cold dir. Drop the flag; the orchestrator already falls back --cache-dir -> $CACHEDIR -> ~/.cache, so this honors CACHEDIR like the old bundle did.

A bench's args can now be a plain string ("-i 264,40 --pulse 24") in addition to the TOML array, so a working argument line pastes in as-is. Custom deserializer; arrays keep working.

Convert every bench's args from a TOML array to the whitespace-string form now supported, so entries read like a command line. No behavior change.

The Bench report job runs on pull_request, which gets a read-only token on fork PRs and so can't comment (the comparison only landed in the job summary). Split posting out: the report job uploads the rendered comment as an artifact, and a new Bench-comment workflow runs on workflow_run (base context, write token) and posts it -- working for fork and same-repo PRs alike. The target PR is resolved from the run's head SHA, never from artifact contents, so a fork can't redirect the comment at another issue.

Resolving the PR from the run's head SHA returns nothing for a fork PR (the head commit lives in the fork, not this repo), so the trampoline posted no comment -- the exact case it exists for. Pass the PR number in the artifact instead, and validate it: fetch that PR and require its head SHA to equal the run's, so a fork can name only its own PR, never redirect at another issue. Mismatch (or a stale re-pushed run) skips quietly.

…e bits) Training-free affine quantize<->dequantize for the KV cache: keep every token but at fewer bits (configurable, 1..16). Keys per-CHANNEL (outlier channels get their own scale), Values per-TOKEN (KIVI, Liu et al. 2024). Gentler than evicting; works for any model. (CommVQ's RoPE-commutative codebook is a fancier follow-on.) Validated: round-trip error <= scale/2 and shrinks with bits; per-channel >> per-token on outlier channels; 8-bit near-lossless for attention output. Real GPT-2 (harness/ kv_quant_real.py): int8 ~0.5% attention deviation (near-lossless, 2x mem), graceful to int2; int4 per-channel-K beats per-token-K 1.75-1.9x on early layers. Memory = bits/16 of the f16 cache (int8 2x, int4 4x, int2 8x). 3 tests, fmt+clippy clean. Follow-on: packed-int storage + a quantized KV-cache op (dequant-on-attend), composing with the in-place (sonos#2321) / sliding-window (sonos#2327) caches; CommVQ codebook variant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ansform Completes the KIVI-style KV-cache quantization integration: 1. QuantKeyCache: per-channel u8 storage for Keys. D channels each have a running scale; new tokens quantized under the current channel scale. Memory: T*D + D*8 bytes. 2. QuantValueCache: per-token u8 storage for Values. Each token D bytes + 2 f32 params. Memory: T*D + T*8 bytes (~4x vs f32 at large D). 3. QuantizedKvSdpa: stateful fused op (Op/EvalOp/TypedOp + OpState + freeze) that stores K/V in packed u8, dequantizes per-head on each decode step, attends via FlashSdpaOp (GQA handled). Real u8 bytes, not just float round-trip quality test. 4. QuantizedKvSdpaTransform: auto-wires {cache(K), cache(V), Sdpa} -> QuantizedKvSdpa. 6 tests: quant quality (3 existing) + packed_u8_saves_memory_vs_f32 (>3x saving) + quantized_kv_sdpa_runs_in_model (engine correctness: near-lossless vs f32 reference) + transform_fuses_cache_sdpa_to_quantized (structural auto-wiring). fmt+clippy clean, transformers 18/0 no regression. Configurable via the bits parameter (1..=16); int8 = near-lossless 4x vs f32 / 2x vs f16. CommVQ codebook variant is the follow-on. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

tract_transformers_quantized_kv_sdpa primitive: axis + optional scale. Round-trip test: axis and scale survive write_to_tar -> model_for_read. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

VLA SVE2 implementation of the row-wise RmsNorm primitive added by the parent stack (sonos#2311 linalg slot + core/nn fast path; sonos#2314 NEON kernel). Plugs into Ops::rms_norm_f32 in sve::plug() when FEAT_SVE2 is present on Linux aarch64, overriding the NEON 4-lane kernel with wider lanes (vl-dependent) and a predicated tail (no scalar epilogue). Structure mirrors the NEON + AVX-512 kernels: Pass 1 — sum of squares via 4 svfloat32_t accumulator chains, 4*svcntw() lanes per iteration. Tail handled by a predicated svwhilelt_b32 loop over the residue — no scalar epilogue. Pass 2 — broadcast inv_std into inv_v, fmul/st1 each 4-vec chunk; same predicated tail. Width-agnostic by construction — identical correct output at any FEAT_SVE streaming vector length (128 → 2048 bits). Wider VL = wider lanes, fewer loop iterations, real perf scaling. Validation (QEMU-only — no SVE hardware locally): - 100 cases pass at SVL=128 (4 lanes), SVL=256 (8 lanes), SVL=512 (16 lanes) via qemu-aarch64 -cpu max,sve{128,256,512}=on. Coverage: every size 1..33, hidden ∈ {768..8192} × 9 tail residues, huge rows up to 32768, all-zero pathological. Bit-equivalent vs scalar within sqrt(n)-scaled tolerance. - Local M1 macOS build clean (tract_sve cfg gated out; new code is purely additive — Linux aarch64 + FEAT_SVE2 only). Expected gain over the NEON kernel scales with SVL: - 128-bit SVE (rare Neoverse-N1): ~0× (same width as NEON) - 256-bit SVE (Graviton G3/G4): ~1.3–1.8× - 512-bit SVE (Neoverse-V2 wide): ~2.5–4× (mirroring AVX-512 vs SSE) Perf number unmeasured pending SVE hardware (AWS Graviton free tier). Same validation shape as PR sonos#2268 (correctness via QEMU + bit-equivalent vs the NEON fallback). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

all_zero and matches_neon_bit_close called the SVE2 kernel without the has_sve2() check the other tests go through, so the cortex-a53 qemu CI job died with SIGILL (signal 4) on the first ungated test. Route all_zero through check() and add the explicit skip to the NEON cross-check. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ScaledMaskedSoftmax::eval hard-coded SoftmaxExp::Libc, so the fused attention softmax always ran scalar libm expf and never reached the linalg SIMD softmax kernels — a real perf gap. The naive fix (switch to SoftmaxExp::FastCompact) trades correctness for speed and fails the proptests two ways: 1. FastCompact's Schraudolph exp is ~0.5% off true softmax — outside the suite's Approximate tolerance (f32 rtol 5e-4), 30%+ outliers. 2. On a fully-masked row (all -inf) the FastCompact kernel pads the SIMD tail with f32::MIN and computes exp(f32::MIN - f32::MIN) ≈ 1, so the row sums to a nonzero value and yields a finite 0 where the scalar libc path (and the numpy reference) yield NaN (0 * 1/0). Instead, add an accurate vectorizable exp and route the fused softmax through it (mirrors ggml/llama.cpp, which kept an accurate vectorized expf for softmax rather than a coarse approximation): * linalg: `accurate_exp_f32`, a Cephes-style range-reduced exp (Cody-Waite ln2 split + degree-6 poly + 2^n by exponent construction). Measured max rel error ~1.9e-6 vs libc over the softmax domain [0, -60]. exp(0)==1 and exp(-inf)==0 exactly; deep underflow flushes to 0; NaN propagates. * linalg: `SSoftMaxL2Accurate` / `HSoftMaxL2Accurate` map-reduce kernels, exposed as `softmax2_accurate_{f32,f16}`. They pad the SIMD tail with -inf (not f32::MIN), so masked/padding lanes contribute exactly 0 and a fully-masked row sums to 0 -> NaN, matching libc and the reference. * core: new `SoftmaxExp::Accurate` variant + dispatch. * nnef: `exp = "accurate"` de/serialization round-trip. * transformers: ScaledMaskedSoftmax::eval uses SoftmaxExp::Accurate. New linalg tests validate the accurate exp against libc (not against itself, unlike the existing FastCompact frame test) and cover the fully-masked degenerate row. scaled_masked_softmax + sdpa proptests (f16/f32, raw/decluttered/optimized) pass on native and wasm32-wasip1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Use a bitmask flush instead of a terminal branch so LLVM can keep the exp dataflow straight-line and auto-vectorize softmax map-reduce loops, while preserving exact underflow-to-zero and NaN propagation behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use mul_add for range reduction and polynomial evaluation in accurate_exp_f32. This keeps the same numerical behavior while reducing scalar instruction count and improving vector-friendly straight-line code generation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kali and others added 23 commits May 28, 2026 11:00

build: track Cargo.lock

a5eeab5

Pin the resolved dependency graph so debug builds, release artifacts, SBOMs and security audits all see the same versions. CI runs against this lockfile; `cargo update` is the explicit knob for bumping deps.

changelog update

ff8abf2

fmt

5ca8dab

setup deny for cli

2614eea

fmt

1d6d1aa

czoli1976 force-pushed the feat/arm64-neon-fp16-activations branch 3 times, most recently from 457c825 to 9a5fa12 Compare May 30, 2026 06:36

mathieupoumeyrolsonos and others added 4 commits June 1, 2026 09:15

release 0.23.0-dev.6

9dcf880

post-release v0.23.0-pre

3ed1478

post-release 0.23.0-pre

4ffe58b

mathieupoumeyrolsonos and others added 27 commits June 17, 2026 13:28

clippy on mac

680af17

fmt

c720378

Update CHANGELOG.md

0fe31bc

bench: pass PR sha to bench-report via env, not run-block expansion

e0049ae

zizmor flagged template-injection: ${{ needs.prepare.outputs.ref }} was expanded straight into the report step's run block. Route it through a PR_SHA env var like the rest of the workflow.

bench: accept manifest args as a whitespace-split string too

f772c89

A bench's args can now be a plain string ("-i 264,40 --pulse 24") in addition to the TOML array, so a working argument line pastes in as-is. Custom deserializer; arrays keep working.

bench: write manifest args as paste-friendly strings

0e119b2

Convert every bench's args from a TOML array to the whitespace-string form now supported, so entries read like a command line. No behavior change.

Update CHANGELOG.md

da2d512

transformers: NNEF ser/de for QuantizedKvSdpa (registered)

885b53d

tract_transformers_quantized_kv_sdpa primitive: axis + optional scale. Round-trip test: axis and scale survive write_to_tar -> model_for_read. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kali force-pushed the feat/arm64-neon-fp16-activations branch from 476ef48 to a18d911 Compare June 18, 2026 14:01

czoli1976 and others added 2 commits June 19, 2026 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/arm64: NEON-fp16 hardswish_f16, silu_f16, gelu_f16 kernels#12

linalg/arm64: NEON-fp16 hardswish_f16, silu_f16, gelu_f16 kernels#12
czoli1976 wants to merge 155 commits into
base/sonos-mainfrom
feat/arm64-neon-fp16-activations

czoli1976 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

czoli1976 commented May 28, 2026

Summary

What

Risk

Test plan

Notes for review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants