Skip to content

linalg/arm64: NEON-fp16 hardswish_f16, silu_f16, gelu_f16 kernels#12

Open
czoli1976 wants to merge 155 commits into
base/sonos-mainfrom
feat/arm64-neon-fp16-activations
Open

linalg/arm64: NEON-fp16 hardswish_f16, silu_f16, gelu_f16 kernels#12
czoli1976 wants to merge 155 commits into
base/sonos-mainfrom
feat/arm64-neon-fp16-activations

Conversation

@czoli1976

Copy link
Copy Markdown
Owner

Summary

Fills the three element-wise f16 slots in arm64fp16::plug that were previously falling through to the generic scalar fallback. Brings the arm64fp16 element-wise coverage to parity with arm64simd's f32 set (sigmoid, tanh, hardswish, silu, gelu, leaky_relu).

This is the ARM half of the asymmetry the AVX-512 PR set (sonos#2303-sonos#2307, sonos#2310, sonos#2311) flagged in the maintainer review pack: x86 already covers all 6 f16 activations via sonos#2310 / #8; on ARM, only 3 of 6 were f16-accelerated. After this PR all 6 fire on arm64fp16.

Independent of the RmsNorm work in sonos#2311 / #9 / #11. Targets base/sonos-main directly.

What

Op Implementation Notes
hardswish_f16 Full native NEON-fp16 inline asm. 16 f16 / iter via ldp q4, q5 + fadd/fmin/fmax/fmul .8h. Pure-fp16 hot path; no transcendental, no f32 round-trip. Mirrors arm64simd_hardswish_f32_8n with .8h lanes.
silu_f16 Compose-via-f32-scratch (CHUNK = 128). Each chunk f16→f32 (fcvtl via auto-vec .to_f32()), arm64simd_sigmoid_f32_4n in place, multiply by saved original, f32→f16 back. Mirrors the AVX-512 x86 #8 / sonos sonos#2310 pattern. Routes through f32 for precision because the sigmoid+multiply chain in f16 would risk SuperApproximate tolerance failures.
gelu_f16 Same compose-via-f32-scratch pattern. Cube x*x*x and pre-scale by sqrt(2/pi) in f32, arm64simd_tanh_f32_4n in place, finish 0.5 * x * (1 + tanh) in f32, f32→f16 back. f32 internal because f16's 11-bit mantissa loses precision on the cube for the `

Plugged in arm64::plug() next to the existing tanh_f16 / sigmoid_f16 / leaky_relu_f16 lines, behind the same has_fp16() gate. Hosts without fp16 (older ARMv8.0) keep falling through to the generic scalar kernels.

Risk

  • fp16 is gated via has_fp16() so non-fp16 hosts skip the plug step entirely.
  • Pure addition: no existing kernel modified, no Ops field semantics changed.
  • The compose patterns for silu_f16 and gelu_f16 reuse arm64simd_sigmoid_f32_4n and arm64simd_tanh_f32_4n which are already validated upstream.

Test plan

  • cargo check --target aarch64-unknown-linux-gnu -p tract-linalg — clean
  • cargo check --target aarch64-unknown-linux-gnu -p tract-linalg --tests — clean
  • cargo check -p tract-linalg (host x86_64) — unchanged
  • cargo fmt --all -- --check clean
  • cargo clippy clean on the new files
  • Runtime tests on real aarch64 hardware — frame tests (hardswish_frame_tests! / silu_frame_tests! / gelu_frame_tests! against the scalar f16 reference at SuperApproximate tolerance) compile cross-target; please run on Apple Silicon / Cortex-A before merging. The kernels mirror established algorithms but no native aarch64 runner was available locally during authoring.
  • End-to-end bench on aarch64arm64fp16_silu_f16_8n and arm64fp16_gelu_f16_8n should be substantially faster than the generic scalar f16 baseline (which goes per-element through f16.to_f32() then scalar transcendentals). Expected shape similar to the AVX-512 linalg/x86_64: add AVX-512 f16 element-wise activations (stacked on #2304) sonos/tract#2310 numbers (4.6× silu, 6.7× gelu over generic) — actual ratios depend on Apple Silicon / Cortex-A fp16 throughput.

Notes for review

The compose-via-f32-scratch decision for silu/gelu (instead of full native fp16 over arm64fp16_sigmoid_f16_8n / arm64fp16_tanh_f16_8n) is deliberate but reversible. A future revisit could write fully-native-fp16 versions that stay in .8h lanes throughout — that would skip the conversion at the IO boundary and might be 1.5-2× faster on Apple Silicon (which has 8-wide native f16 SIMD versus 4-wide f32). Trade-off is precision on the cube term; SuperApproximate tolerance may not hold without f32 internal accumulation. Easy follow-up if the perf delta motivates it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Generated by Claude Code

kali and others added 23 commits May 28, 2026 11:00
`TypedModelPatch::shunt_outside` leaves the shunted node in the graph,
but the NNEF `patch` transform also implicitly removed model inputs
whose name appeared on the LHS.  That hidden side-effect made
`patch` do two things at once: substitute a wire, and trim the
interface.  Drop the trimming.

Add a sibling `select_inputs(inputs: [...])` transform shaped like
`select_outputs`.  The pulse pipeline now reads:

  -t 'patch(body: "length = tract_core_shape_of(input_signal)[1];")' \
  -t 'select_inputs(inputs: ["input_signal"])'              \
  -t 'select_outputs(outputs: ["processed_signal"])'        \
  -t 'pulse(symbol: ..., pulse: ...)'

Discarded Sources stay in the graph until declutter prunes them.

Wire-up: `Graph::select_inputs_by_name` (mirror of
`select_outputs_by_name`) + `with_inputs_by_name` + transform
registration.  Updated harness/nemotron + nemo-nemotron-asr +
nemo-nemotron-streaming-asr to add the explicit `select_inputs` step.
The 'without-default-features' job in full.yml (cargo check -p tract-cli
--no-default-features) regressed after the cuda-12XXX split: cudarc and
tract-cuda were still pulled in unconditionally on linux/windows targets,
so stripping the cuda-13000 default left cudarc with no API-version
feature and its build script panicked.

Make both deps optional in tract-cli and tract-libcli, and have each
cuda-XXXXX feature pull them in (dep:cudarc + dep:tract-cuda +
tract-cuda/cuda-XXXXX + tract-libcli/cuda).  Adds a marker 'cuda'
feature so cudarc-touching code in bench.rs / dump.rs / libcli/lib.rs
can gate cleanly.

test-cuda explicitly opts into cuda-13000 (workspace dep has
default-features=false now), so 'cargo test -p tract-cuda -p test-cuda'
keeps building.
Unify the four overlapping names for 'bind a symbol to a value across
the model graph' under one verb:

  - core: `TypedModel::substitute_symbols` → `set_symbols`
  - core: `TypedOp::substitute_symbols` trait method → `set_symbols`
  - transform name: `concretize_symbols(values: …)` → `set_symbols(values: …)`
  - Rust API: `ConcretizeSymbols` → `SetSymbols`
  - Python API: `tract.ConcretizeSymbols` → `tract.SetSymbols`

The CLI `--set B=1` flag was already aligned and is unchanged.  No
deprecation aliases — hard rename across cli, harness scripts, examples
and Python bindings.

The Rust API builder gains a `SetSymbols::expr(name, str)` companion
to `value(name, i64)` so callers can pass TDim expressions (e.g.
`'2*S'`) the way the CLI `--set` and the transform already do.

`TDim::substitute` / `TDim::substitute_all` are unchanged: they
operate on a single TDim expression, not on the model, and "substitute"
is the accurate verb for that level.
The top-level `--set` flag was already TDim-aware via `parse_set_subs`
in params.rs; the `run` subcommand had a parallel `--set` flag that
only accepted plain i64.  Parse RHS as a TDim against the model's
symbol scope and reduce to i64 with the symbols set so far on the
command line, so `run --set FOO=2 --set T=2*FOO` resolves cleanly.

Order is CLI-significant: a symbol referenced on the RHS must be set
to its left.  Errors out with the unresolved name in the message.
The optimized Scan body runs the same plan with the same shapes every
timestep, so resolve its symbols once, reset between iters without
discarding them (reset_turn_keep_symbols), and reuse one drained input
buffer -- instead of a full model_state.run() cycle (set_inputs ->
resolve_symbols -> exec -> outputs -> reset_turn) per timestep.

Bit-identical to the old path across GRU/LSTM/RNN + df_dec. No measurable
wall-clock impact on fixed main (within +/-1% noise on gru/lstm/rnn 128/50
& 256/100 and df_dec, single-thread); kept as a cleanup of the per-iter
re-entry path, not as a perf change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prefill-only GroupQueryAttention lowered onto tract Sdpa: reshapes Q/K/V to
4D, applies an explicit lower-triangular causal mask, and returns
present_key/present_value (the reshaped K/V). Sdpa handles the grouped-query
head sharing (kv_num_heads < num_heads). Decode-step KV cache, internal
rotary (do_rotary), local-window attention and softcap are rejected with
clear errors.

Validated against onnxruntime across head_size 8/16/64, several
num_heads/kv_num_heads ratios (incl. multi-query kv=1) and batch>1: attention
output matches to <=3.6e-7 and present_key/present_value are bit-exact.

ORT's GroupQueryAttention prefill is standard causal grouped-query attention;
the seqlens_k input is the 0-indexed position of the last token
(total_sequence_length - 1), not the token count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin the resolved dependency graph so debug builds, release artifacts,
SBOMs and security audits all see the same versions.  CI runs against
this lockfile; `cargo update` is the explicit knob for bumping deps.
For each tract-cli release artifact (per target triple), generate both
CycloneDX and SPDX SBOMs from the workspace (Cargo.lock-driven) via
anchore/sbom-action and upload them alongside the .tgz.  Also pass
--locked to the release build so the SBOM matches the resolved deps
exactly.

The sbom-action ref is currently the v0.18.0 tag — dependabot
github-actions runs weekly and will SHA-pin on its next pass.
Each tract-cli release tarball now gets two GitHub attestations
(CycloneDX and SPDX SBOM, via actions/attest-sbom).  Anyone can
verify after download with:

  gh attestation verify tract-<triple>-<version>.tgz --owner sonos

Requires `id-token: write` + `attestations: write` on the job.
sbom-action's `upload-release-assets: false` keeps the SBOM files
out of its own upload path so the explicit softprops step is the
single source of release artifacts.
- cargo auditable wraps the release build so the resolved dependency
  graph lands inside the binary itself.  Consumers can recover the
  SBOM with `cargo audit bin tract` without needing the published
  .cdx.json / .spdx.json files.
- actions/attest-build-provenance@v2 signs the .tgz with provenance
  metadata (workflow ref, commit SHA, runner).  Combined with the
  existing SBOM attestations this lands at SLSA Build Level 3.
Pinned commits (latest stable as of writing):
- anchore/sbom-action @ e22c389 (v0.24.0)
- actions/attest-sbom @ c604332 (v4.1.0)
- actions/attest-build-provenance @ a2bbfa2 (v4.1.0)

Matches the existing SHA + comment convention used for
actions/checkout and softprops/action-gh-release; dependabot's
github-actions group will keep them current.
Two-part change so consumers can audit the deps that landed in the
tract Python wheel without needing to re-clone the Rust workspace:

1. `api/py/pyproject.toml` (Linux + macOS cibuildwheel before-build):
   install cargo-auditable and write a one-line bash shim that
   prefixes `auditable` to every cargo invocation.  setuptools_rust
   honours $CARGO (build.py:97), so pointing CARGO at the shim makes
   the Rust .so inside the wheel carry its dep graph in the
   `.dep-v0` ELF/Mach-O section.  Windows wheels stay as-is for now
   (TODO comment).

2. `.github/workflows/wheels.yml` + `.github/scripts/inject_wheel_sboms.py`:
   after cibuildwheel emits each .whl, install syft (via
   anchore/sbom-action/download-syft, SHA-pinned), unpack the wheel,
   scan its contents (syft's rust-audit-binary cataloger reads the
   embedded cargo-auditable section), drop sbom.cdx.json +
   sbom.spdx.json into `<dist-info>/sboms/` per PEP 770, and
   re-pack via `wheel pack` (which regenerates RECORD with hashes).

Smoke-tested locally on a sample wheel: SBOMs end up at the right
path and RECORD has correct sha256 entries.
atty (0.2.x) is unmaintained and triggers RUSTSEC-2021-0145 on SBOM
audits.  It's only used in two places — both `is stderr a TTY`
checks in `tract hwbench` — and std::io::IsTerminal (stable since
1.70, well below tract's MSRV) is a drop-in.

`cargo tree -i atty` after the change reports the crate is no longer
in the workspace dep graph.
runtime_for_name("gpu")        → first GPU backend whose `check()`
                                 passes (metal, then cuda); error if
                                 none are available.
runtime_for_name("gpu-or-cpu") → same lookup, but falls through to
                                 the `default` CPU runtime instead
                                 of erroring.

No new mechanism — both names walk the existing inventory and use each
backend's existing `check()` to decide availability.  Backend-specific
names (`cuda`, `metal`) still work as before.
…or_name

The CPU runtime now reports its own name as `cpu` (which is what it
is), so `list-runtimes` shows `cpu`, `cuda`, `metal` … instead of
the misleading `default`.

Back-compat for callers passing `default` is handled by a one-line
alias in `runtime_for_name` rather than by registering two runtimes
or by polluting the trait — the alias only affects name lookup, not
the inventory.
The `tensorflow` 0.21.0 crate (Rust binding for libtensorflow) was
only pulled in behind the dead `conform` cargo feature — which gated
`tract compare --tf` (compare tract output against running on
libtensorflow on the same model).  The feature isn't enabled in any
GitHub workflow; only a stranded `.travis/tf.sh` ever ran it.

The upstream `tensorflow` crate hasn't shipped since 2023-08-15 and
pins to rust-protobuf 2.27.x, which trips RUSTSEC-2024-0437.  Drop
the feature and all its plumbing.

Tract's own `.pb` parsing (used by `-t transformers_detect_all` and
the `tf` cargo feature in tract-cli) goes through prost and is
unaffected — the `tract-tensorflow` crate stays, just without the
libtensorflow runtime.  Cargo.lock shrinks by ~350 lines as a
side-effect.
The LayerNorm op's `wire` expansion casts `normalized` back to
fact.datum_type *before* applying scale/bias, then multiplies that
result with `cast_scale` (which is still in self.datum_type, F32).

With F16 inputs this becomes F16 × F32, whose output is downgraded to
F32 by `mul()`. The inference rule then asserts
`outputs[0].datum_type == inputs[0].datum_type` (F16) against the
actual F32 output, failing `into_typed()` with:

    Output mismatch after rewiring expansion for output #0:
    expected 1,256,384,F16 got 1,256,384,F32

Fix: defer the cast back to fact.datum_type until after all scale/bias
operations. Now the expansion stays entirely in self.datum_type (F32)
through normalized × scale + bias, and casts only the final result.

Behavior is unchanged for F32 inputs (the final cast is a no-op when
fact.datum_type == self.datum_type).

Reproduced with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
exported via `optimum.exporters.onnx.main_export(..., dtype="fp16")`
and loaded with `into_optimized().into_runnable()`.
The single-thread MMM tile walk used a naive nested loop, re-streaming the
full inner operand (all of A in col-outer / B in row-outer) per panel at
large k, which is memory/L1-bound. The multithread path already 2D-blocks the
panel grid (chunk_grid); this brings the same blocking to the single-thread
path, with the block edge cache-derived (detected L2/3, conservative 256 KiB
fallback) so it stays L2-resident across hardware and never over-blocks a
cache it cannot see.

Bit-identical: it only reorders independent tiles (each computes its full-k
reduction into a disjoint C region). The block-edge floor of 1 degrades
exactly to the naive loop; the cap of 16 matches the multithread chunk_grid
blocking already shipped on all platforms. Frame-level, so all kernels
benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV /
multithreaded shapes are unchanged.

Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against
the naive reference (the existing frame proptests only reach 3 panels).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant
(it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt
--check` failed in CI. Pure formatting; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976 czoli1976 force-pushed the feat/arm64-neon-fp16-activations branch 3 times, most recently from 457c825 to 9a5fa12 Compare May 30, 2026 06:36
mathieupoumeyrolsonos and others added 4 commits June 1, 2026 09:15
Passing 'v0.24.0' to post-release.sh writes 'version = "v0.24.0"' into
every Cargo.toml — invalid semver, breaks the workspace, easy to do by
muscle memory because the git tag does carry the 'v' prefix. Bail out
early in both release.sh and post-release.sh when the argument doesn't
match an unprefixed semver.
mathieupoumeyrolsonos and others added 27 commits June 17, 2026 13:28
Hand-apply the suggestions clippy --fix can't (then_some, clamp, sort_by_key, drop
of non-Drop views, an iterator rewrite, is_empty alongside len, merged identical
cast branches) and local-allow the rest per policy (too_many_arguments /
type_complexity on internal items; one needless_range_loop where the range is the
pool rank, not the indexed slice's length).
P0 of the all-Rust bench toolkit: split the bench and llm-bench handlers into a
run() that measures and returns structured metrics (BenchResult { metrics, iters })
and a thin handle() that just prints them. No behavior/output change — the
subcommands are byte-identical — but the bench suite can now call run() and consume
data instead of parsing stdout. Readings (time_to/rsz/active) come from the build
probe and land in P1.
The bench-suite orchestrator spawns one fresh tract per bench and reads
its metrics back over stdout. In --emit-jsonl mode the bench/llm-bench
handlers print one {"metric":..,"value":..} line per metric and nothing
else; logs already go to stderr, so stdout stays pure JSONL.
In --emit-jsonl mode the child parses its own readings.out and appends
time_to/rsz/active for the model_ready and before_optimize checkpoints,
replacing the shell's grep/sed/cut scraping with a typed parser. Values
match the old scraping byte-for-byte on the same readings file.
tract bench-suite reads a TOML manifest and runs each bench in a fresh
child process via --emit-jsonl, so memory readings get the cold process
they need. It parses each child's JSONL stdout (any non-metric line is a
hard failure for that run), prefixes the metric names, fetches models
over HTTPS, pins the governor, and writes the metrics file -- replacing
bundle-entrypoint.sh's fetch/scrape/retry shell. benches.toml ports the
full bundle bench list. Feature-gated behind bench-suite (pulls toml/tar).
Add a bench_common module (series noise, red threshold, median, higher-
better) and a bench-expectations subcommand that emits the per-metric
'metric expected threshold' file from bench-data history -- output is
byte-identical to bench-expectations.py. bench-suite now takes that file
and re-runs a bench, fresh child each time, while it stays out of
threshold (same bar as a PR red), keeping the per-metric best up to
--retry-max times; without it the suite is single-shot.
bench-suite can now build retry expectations itself from a bench-data
checkout (--bench-data/--thresholds/--triple/--device) instead of
requiring a pre-generated --expectations file. The Python script needed a
separate job because it could not run on the constrained runners; the
Rust path has no such limit, so the bench host can do it in one command.
The standalone bench-expectations subcommand stays, sharing compute().
The bench step now runs 'tract bench-suite' instead of bundle-entrypoint.sh:
it fetches models over HTTPS, builds this device's retry expectations inline
from a read-only bench-data checkout, and retries. That removes the separate
expectations job (its only reason was running bench-expectations.py off the
Jetson, which lacks tomllib) and the AWS OIDC creds step (HTTPS fetch needs no
auth once the model/ prefix is public). The cli is built --features bench-suite
(cross.sh adds it for the Jetson). Append and report stay Python for now.

Requires the tract-ci-builds model/ prefix to be public-read; until then the
HTTPS fetch 403s.
tract bench-report replaces bench-report.py: it reads the per-device
results + bench-data reference, computes movers with the shared
bench_common red logic (now single-sourced with retry), and renders the
PR comment + job summary from editable minijinja templates in .travis/.
PR-comment output is byte-identical to the Python; the job summary
matches but for one trailing blank line. read_metrics now keeps file
order so equal-delta ties sort as before.
tract bench-append replaces bench-append.py: 4-sig-fig half-to-even
rounding, null-padded columnar arrays, indent=0 one-value-per-line JSON
-- byte-identical to the Python on fresh and append (so git diffs stay
small). The bench workflow's nightly append and PR report steps now call
the tract binary (the report job downloads the x86_64 artifact), so no
job runs Python. Deletes bench_common.py, bench-expectations.py,
bench-report.py, bench-append.py: the bench pipeline is Python-free.
zizmor flagged template-injection: ${{ needs.prepare.outputs.ref }} was
expanded straight into the report step's run block. Route it through a
PR_SHA env var like the rest of the workflow.
Carry the still-relevant fix/bench-retry-bugs fixes into the Rust suite
(that PR edits the shell/python files this branch deletes):
- single-source the comparison baseline as the median of recent non-null
  (bench_common::reference_value), used by both bench-expectations and the
  report, so retry and the PR red judge the same number (retry==red);
- report writes no comment when there are no comparable metrics, and the
  workflow gates the report on !cancelled() + skips posting a missing file,
  so a superseded run can't overwrite a real comment with 'no regressions';
- pin the GPU graphics clock for the run, reset on drop (cuda free-boost
  variance), alongside the CPU governor pin.

Also build/clippy/test the bench-suite feature in the crates workflow (it
is not a default feature, so nothing else compiled it), and add unit tests
for the threshold math.
The bench step hardcoded --cache-dir, overriding each self-hosted runner's
CACHEDIR env (which points at its curated model cache) and forcing a full
re-fetch into a cold dir. Drop the flag; the orchestrator already falls back
--cache-dir -> $CACHEDIR -> ~/.cache, so this honors CACHEDIR like the old
bundle did.
A bench's args can now be a plain string ("-i 264,40 --pulse 24") in
addition to the TOML array, so a working argument line pastes in as-is.
Custom deserializer; arrays keep working.
Convert every bench's args from a TOML array to the whitespace-string form
now supported, so entries read like a command line. No behavior change.
The Bench report job runs on pull_request, which gets a read-only token on
fork PRs and so can't comment (the comparison only landed in the job summary).
Split posting out: the report job uploads the rendered comment as an artifact,
and a new Bench-comment workflow runs on workflow_run (base context, write
token) and posts it -- working for fork and same-repo PRs alike. The target PR
is resolved from the run's head SHA, never from artifact contents, so a fork
can't redirect the comment at another issue.
Resolving the PR from the run's head SHA returns nothing for a fork PR (the
head commit lives in the fork, not this repo), so the trampoline posted no
comment -- the exact case it exists for. Pass the PR number in the artifact
instead, and validate it: fetch that PR and require its head SHA to equal the
run's, so a fork can name only its own PR, never redirect at another issue.
Mismatch (or a stale re-pushed run) skips quietly.
…e bits)

Training-free affine quantize<->dequantize for the KV cache: keep every token but at
fewer bits (configurable, 1..16). Keys per-CHANNEL (outlier channels get their own
scale), Values per-TOKEN (KIVI, Liu et al. 2024). Gentler than evicting; works for any
model. (CommVQ's RoPE-commutative codebook is a fancier follow-on.)

Validated: round-trip error <= scale/2 and shrinks with bits; per-channel >> per-token on
outlier channels; 8-bit near-lossless for attention output. Real GPT-2 (harness/
kv_quant_real.py): int8 ~0.5% attention deviation (near-lossless, 2x mem), graceful to
int2; int4 per-channel-K beats per-token-K 1.75-1.9x on early layers. Memory = bits/16 of
the f16 cache (int8 2x, int4 4x, int2 8x). 3 tests, fmt+clippy clean.

Follow-on: packed-int storage + a quantized KV-cache op (dequant-on-attend), composing
with the in-place (sonos#2321) / sliding-window (sonos#2327) caches; CommVQ codebook variant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ansform

Completes the KIVI-style KV-cache quantization integration:

1. QuantKeyCache: per-channel u8 storage for Keys. D channels each have a running
   scale; new tokens quantized under the current channel scale. Memory: T*D + D*8 bytes.
2. QuantValueCache: per-token u8 storage for Values. Each token D bytes + 2 f32 params.
   Memory: T*D + T*8 bytes (~4x vs f32 at large D).
3. QuantizedKvSdpa: stateful fused op (Op/EvalOp/TypedOp + OpState + freeze) that
   stores K/V in packed u8, dequantizes per-head on each decode step, attends via
   FlashSdpaOp (GQA handled). Real u8 bytes, not just float round-trip quality test.
4. QuantizedKvSdpaTransform: auto-wires {cache(K), cache(V), Sdpa} -> QuantizedKvSdpa.

6 tests: quant quality (3 existing) + packed_u8_saves_memory_vs_f32 (>3x saving) +
quantized_kv_sdpa_runs_in_model (engine correctness: near-lossless vs f32 reference) +
transform_fuses_cache_sdpa_to_quantized (structural auto-wiring). fmt+clippy clean,
transformers 18/0 no regression.

Configurable via the bits parameter (1..=16); int8 = near-lossless 4x vs f32 / 2x vs
f16. CommVQ codebook variant is the follow-on.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tract_transformers_quantized_kv_sdpa primitive: axis + optional scale.
Round-trip test: axis and scale survive write_to_tar -> model_for_read.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
VLA SVE2 implementation of the row-wise RmsNorm primitive added by the
parent stack (sonos#2311 linalg slot + core/nn fast path; sonos#2314 NEON kernel).
Plugs into Ops::rms_norm_f32 in sve::plug() when FEAT_SVE2 is present on
Linux aarch64, overriding the NEON 4-lane kernel with wider lanes
(vl-dependent) and a predicated tail (no scalar epilogue).

Structure mirrors the NEON + AVX-512 kernels:

  Pass 1 — sum of squares via 4 svfloat32_t accumulator chains, 4*svcntw()
           lanes per iteration. Tail handled by a predicated svwhilelt_b32
           loop over the residue — no scalar epilogue.
  Pass 2 — broadcast inv_std into inv_v, fmul/st1 each 4-vec chunk;
           same predicated tail.

Width-agnostic by construction — identical correct output at any FEAT_SVE
streaming vector length (128 → 2048 bits). Wider VL = wider lanes, fewer
loop iterations, real perf scaling.

Validation (QEMU-only — no SVE hardware locally):
- 100 cases pass at SVL=128 (4 lanes), SVL=256 (8 lanes), SVL=512 (16
  lanes) via qemu-aarch64 -cpu max,sve{128,256,512}=on. Coverage: every
  size 1..33, hidden ∈ {768..8192} × 9 tail residues, huge rows up to
  32768, all-zero pathological. Bit-equivalent vs scalar within
  sqrt(n)-scaled tolerance.
- Local M1 macOS build clean (tract_sve cfg gated out; new code is purely
  additive — Linux aarch64 + FEAT_SVE2 only).

Expected gain over the NEON kernel scales with SVL:
- 128-bit SVE (rare Neoverse-N1):  ~0× (same width as NEON)
- 256-bit SVE (Graviton G3/G4):    ~1.3–1.8×
- 512-bit SVE (Neoverse-V2 wide):  ~2.5–4× (mirroring AVX-512 vs SSE)

Perf number unmeasured pending SVE hardware (AWS Graviton free tier).
Same validation shape as PR sonos#2268 (correctness via QEMU + bit-equivalent
vs the NEON fallback).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
all_zero and matches_neon_bit_close called the SVE2 kernel without the
has_sve2() check the other tests go through, so the cortex-a53 qemu CI
job died with SIGILL (signal 4) on the first ungated test. Route
all_zero through check() and add the explicit skip to the NEON
cross-check.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ScaledMaskedSoftmax::eval hard-coded SoftmaxExp::Libc, so the fused
attention softmax always ran scalar libm expf and never reached the
linalg SIMD softmax kernels — a real perf gap. The naive fix (switch to
SoftmaxExp::FastCompact) trades correctness for speed and fails the
proptests two ways:

  1. FastCompact's Schraudolph exp is ~0.5% off true softmax — outside
     the suite's Approximate tolerance (f32 rtol 5e-4), 30%+ outliers.
  2. On a fully-masked row (all -inf) the FastCompact kernel pads the
     SIMD tail with f32::MIN and computes exp(f32::MIN - f32::MIN) ≈ 1,
     so the row sums to a nonzero value and yields a finite 0 where the
     scalar libc path (and the numpy reference) yield NaN (0 * 1/0).

Instead, add an accurate vectorizable exp and route the fused softmax
through it (mirrors ggml/llama.cpp, which kept an accurate vectorized
expf for softmax rather than a coarse approximation):

  * linalg: `accurate_exp_f32`, a Cephes-style range-reduced exp
    (Cody-Waite ln2 split + degree-6 poly + 2^n by exponent
    construction). Measured max rel error ~1.9e-6 vs libc over the
    softmax domain [0, -60]. exp(0)==1 and exp(-inf)==0 exactly; deep
    underflow flushes to 0; NaN propagates.
  * linalg: `SSoftMaxL2Accurate` / `HSoftMaxL2Accurate` map-reduce
    kernels, exposed as `softmax2_accurate_{f32,f16}`. They pad the
    SIMD tail with -inf (not f32::MIN), so masked/padding lanes
    contribute exactly 0 and a fully-masked row sums to 0 -> NaN,
    matching libc and the reference.
  * core: new `SoftmaxExp::Accurate` variant + dispatch.
  * nnef: `exp = "accurate"` de/serialization round-trip.
  * transformers: ScaledMaskedSoftmax::eval uses SoftmaxExp::Accurate.

New linalg tests validate the accurate exp against libc (not against
itself, unlike the existing FastCompact frame test) and cover the
fully-masked degenerate row. scaled_masked_softmax + sdpa proptests
(f16/f32, raw/decluttered/optimized) pass on native and wasm32-wasip1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kali kali force-pushed the feat/arm64-neon-fp16-activations branch from 476ef48 to a18d911 Compare June 18, 2026 14:01
czoli1976 and others added 2 commits June 19, 2026 13:00
Use a bitmask flush instead of a terminal branch so LLVM can keep the
exp dataflow straight-line and auto-vectorize softmax map-reduce loops,
while preserving exact underflow-to-zero and NaN propagation behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use mul_add for range reduction and polynomial evaluation in accurate_exp_f32.
This keeps the same numerical behavior while reducing scalar instruction
count and improving vector-friendly straight-line code generation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants