Skip to content

arm64 SDOT int8 matmul kernel (FEAT_DotProd) — CI mirror of sonos#2278#13

Open
czoli1976 wants to merge 37 commits into
base/sonos-mainfrom
feat/int8-sdot-kernel
Open

arm64 SDOT int8 matmul kernel (FEAT_DotProd) — CI mirror of sonos#2278#13
czoli1976 wants to merge 37 commits into
base/sonos-mainfrom
feat/int8-sdot-kernel

Conversation

@czoli1976

Copy link
Copy Markdown
Owner

Fork-internal PR to run the Embedded-targets / Rust-crates CI for the SDOT kernel + the new build.rs sdot assembler probe (stretch fallback). Tracks sonos#2278. Not for review here.

kali and others added 30 commits May 28, 2026 11:00
`TypedModelPatch::shunt_outside` leaves the shunted node in the graph,
but the NNEF `patch` transform also implicitly removed model inputs
whose name appeared on the LHS.  That hidden side-effect made
`patch` do two things at once: substitute a wire, and trim the
interface.  Drop the trimming.

Add a sibling `select_inputs(inputs: [...])` transform shaped like
`select_outputs`.  The pulse pipeline now reads:

  -t 'patch(body: "length = tract_core_shape_of(input_signal)[1];")' \
  -t 'select_inputs(inputs: ["input_signal"])'              \
  -t 'select_outputs(outputs: ["processed_signal"])'        \
  -t 'pulse(symbol: ..., pulse: ...)'

Discarded Sources stay in the graph until declutter prunes them.

Wire-up: `Graph::select_inputs_by_name` (mirror of
`select_outputs_by_name`) + `with_inputs_by_name` + transform
registration.  Updated harness/nemotron + nemo-nemotron-asr +
nemo-nemotron-streaming-asr to add the explicit `select_inputs` step.
The 'without-default-features' job in full.yml (cargo check -p tract-cli
--no-default-features) regressed after the cuda-12XXX split: cudarc and
tract-cuda were still pulled in unconditionally on linux/windows targets,
so stripping the cuda-13000 default left cudarc with no API-version
feature and its build script panicked.

Make both deps optional in tract-cli and tract-libcli, and have each
cuda-XXXXX feature pull them in (dep:cudarc + dep:tract-cuda +
tract-cuda/cuda-XXXXX + tract-libcli/cuda).  Adds a marker 'cuda'
feature so cudarc-touching code in bench.rs / dump.rs / libcli/lib.rs
can gate cleanly.

test-cuda explicitly opts into cuda-13000 (workspace dep has
default-features=false now), so 'cargo test -p tract-cuda -p test-cuda'
keeps building.
Unify the four overlapping names for 'bind a symbol to a value across
the model graph' under one verb:

  - core: `TypedModel::substitute_symbols` → `set_symbols`
  - core: `TypedOp::substitute_symbols` trait method → `set_symbols`
  - transform name: `concretize_symbols(values: …)` → `set_symbols(values: …)`
  - Rust API: `ConcretizeSymbols` → `SetSymbols`
  - Python API: `tract.ConcretizeSymbols` → `tract.SetSymbols`

The CLI `--set B=1` flag was already aligned and is unchanged.  No
deprecation aliases — hard rename across cli, harness scripts, examples
and Python bindings.

The Rust API builder gains a `SetSymbols::expr(name, str)` companion
to `value(name, i64)` so callers can pass TDim expressions (e.g.
`'2*S'`) the way the CLI `--set` and the transform already do.

`TDim::substitute` / `TDim::substitute_all` are unchanged: they
operate on a single TDim expression, not on the model, and "substitute"
is the accurate verb for that level.
The top-level `--set` flag was already TDim-aware via `parse_set_subs`
in params.rs; the `run` subcommand had a parallel `--set` flag that
only accepted plain i64.  Parse RHS as a TDim against the model's
symbol scope and reduce to i64 with the symbols set so far on the
command line, so `run --set FOO=2 --set T=2*FOO` resolves cleanly.

Order is CLI-significant: a symbol referenced on the RHS must be set
to its left.  Errors out with the unresolved name in the message.
The optimized Scan body runs the same plan with the same shapes every
timestep, so resolve its symbols once, reset between iters without
discarding them (reset_turn_keep_symbols), and reuse one drained input
buffer -- instead of a full model_state.run() cycle (set_inputs ->
resolve_symbols -> exec -> outputs -> reset_turn) per timestep.

Bit-identical to the old path across GRU/LSTM/RNN + df_dec. No measurable
wall-clock impact on fixed main (within +/-1% noise on gru/lstm/rnn 128/50
& 256/100 and df_dec, single-thread); kept as a cleanup of the per-iter
re-entry path, not as a perf change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prefill-only GroupQueryAttention lowered onto tract Sdpa: reshapes Q/K/V to
4D, applies an explicit lower-triangular causal mask, and returns
present_key/present_value (the reshaped K/V). Sdpa handles the grouped-query
head sharing (kv_num_heads < num_heads). Decode-step KV cache, internal
rotary (do_rotary), local-window attention and softcap are rejected with
clear errors.

Validated against onnxruntime across head_size 8/16/64, several
num_heads/kv_num_heads ratios (incl. multi-query kv=1) and batch>1: attention
output matches to <=3.6e-7 and present_key/present_value are bit-exact.

ORT's GroupQueryAttention prefill is standard causal grouped-query attention;
the seqlens_k input is the 0-indexed position of the last token
(total_sequence_length - 1), not the token count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin the resolved dependency graph so debug builds, release artifacts,
SBOMs and security audits all see the same versions.  CI runs against
this lockfile; `cargo update` is the explicit knob for bumping deps.
For each tract-cli release artifact (per target triple), generate both
CycloneDX and SPDX SBOMs from the workspace (Cargo.lock-driven) via
anchore/sbom-action and upload them alongside the .tgz.  Also pass
--locked to the release build so the SBOM matches the resolved deps
exactly.

The sbom-action ref is currently the v0.18.0 tag — dependabot
github-actions runs weekly and will SHA-pin on its next pass.
Each tract-cli release tarball now gets two GitHub attestations
(CycloneDX and SPDX SBOM, via actions/attest-sbom).  Anyone can
verify after download with:

  gh attestation verify tract-<triple>-<version>.tgz --owner sonos

Requires `id-token: write` + `attestations: write` on the job.
sbom-action's `upload-release-assets: false` keeps the SBOM files
out of its own upload path so the explicit softprops step is the
single source of release artifacts.
- cargo auditable wraps the release build so the resolved dependency
  graph lands inside the binary itself.  Consumers can recover the
  SBOM with `cargo audit bin tract` without needing the published
  .cdx.json / .spdx.json files.
- actions/attest-build-provenance@v2 signs the .tgz with provenance
  metadata (workflow ref, commit SHA, runner).  Combined with the
  existing SBOM attestations this lands at SLSA Build Level 3.
Pinned commits (latest stable as of writing):
- anchore/sbom-action @ e22c389 (v0.24.0)
- actions/attest-sbom @ c604332 (v4.1.0)
- actions/attest-build-provenance @ a2bbfa2 (v4.1.0)

Matches the existing SHA + comment convention used for
actions/checkout and softprops/action-gh-release; dependabot's
github-actions group will keep them current.
Two-part change so consumers can audit the deps that landed in the
tract Python wheel without needing to re-clone the Rust workspace:

1. `api/py/pyproject.toml` (Linux + macOS cibuildwheel before-build):
   install cargo-auditable and write a one-line bash shim that
   prefixes `auditable` to every cargo invocation.  setuptools_rust
   honours $CARGO (build.py:97), so pointing CARGO at the shim makes
   the Rust .so inside the wheel carry its dep graph in the
   `.dep-v0` ELF/Mach-O section.  Windows wheels stay as-is for now
   (TODO comment).

2. `.github/workflows/wheels.yml` + `.github/scripts/inject_wheel_sboms.py`:
   after cibuildwheel emits each .whl, install syft (via
   anchore/sbom-action/download-syft, SHA-pinned), unpack the wheel,
   scan its contents (syft's rust-audit-binary cataloger reads the
   embedded cargo-auditable section), drop sbom.cdx.json +
   sbom.spdx.json into `<dist-info>/sboms/` per PEP 770, and
   re-pack via `wheel pack` (which regenerates RECORD with hashes).

Smoke-tested locally on a sample wheel: SBOMs end up at the right
path and RECORD has correct sha256 entries.
atty (0.2.x) is unmaintained and triggers RUSTSEC-2021-0145 on SBOM
audits.  It's only used in two places — both `is stderr a TTY`
checks in `tract hwbench` — and std::io::IsTerminal (stable since
1.70, well below tract's MSRV) is a drop-in.

`cargo tree -i atty` after the change reports the crate is no longer
in the workspace dep graph.
runtime_for_name("gpu")        → first GPU backend whose `check()`
                                 passes (metal, then cuda); error if
                                 none are available.
runtime_for_name("gpu-or-cpu") → same lookup, but falls through to
                                 the `default` CPU runtime instead
                                 of erroring.

No new mechanism — both names walk the existing inventory and use each
backend's existing `check()` to decide availability.  Backend-specific
names (`cuda`, `metal`) still work as before.
…or_name

The CPU runtime now reports its own name as `cpu` (which is what it
is), so `list-runtimes` shows `cpu`, `cuda`, `metal` … instead of
the misleading `default`.

Back-compat for callers passing `default` is handled by a one-line
alias in `runtime_for_name` rather than by registering two runtimes
or by polluting the trait — the alias only affects name lookup, not
the inventory.
The `tensorflow` 0.21.0 crate (Rust binding for libtensorflow) was
only pulled in behind the dead `conform` cargo feature — which gated
`tract compare --tf` (compare tract output against running on
libtensorflow on the same model).  The feature isn't enabled in any
GitHub workflow; only a stranded `.travis/tf.sh` ever ran it.

The upstream `tensorflow` crate hasn't shipped since 2023-08-15 and
pins to rust-protobuf 2.27.x, which trips RUSTSEC-2024-0437.  Drop
the feature and all its plumbing.

Tract's own `.pb` parsing (used by `-t transformers_detect_all` and
the `tf` cargo feature in tract-cli) goes through prost and is
unaffected — the `tract-tensorflow` crate stays, just without the
libtensorflow runtime.  Cargo.lock shrinks by ~350 lines as a
side-effect.
The LayerNorm op's `wire` expansion casts `normalized` back to
fact.datum_type *before* applying scale/bias, then multiplies that
result with `cast_scale` (which is still in self.datum_type, F32).

With F16 inputs this becomes F16 × F32, whose output is downgraded to
F32 by `mul()`. The inference rule then asserts
`outputs[0].datum_type == inputs[0].datum_type` (F16) against the
actual F32 output, failing `into_typed()` with:

    Output mismatch after rewiring expansion for output #0:
    expected 1,256,384,F16 got 1,256,384,F32

Fix: defer the cast back to fact.datum_type until after all scale/bias
operations. Now the expansion stays entirely in self.datum_type (F32)
through normalized × scale + bias, and casts only the final result.

Behavior is unchanged for F32 inputs (the final cast is a no-op when
fact.datum_type == self.datum_type).

Reproduced with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
exported via `optimum.exporters.onnx.main_export(..., dtype="fp16")`
and loaded with `into_optimized().into_runnable()`.
The single-thread MMM tile walk used a naive nested loop, re-streaming the
full inner operand (all of A in col-outer / B in row-outer) per panel at
large k, which is memory/L1-bound. The multithread path already 2D-blocks the
panel grid (chunk_grid); this brings the same blocking to the single-thread
path, with the block edge cache-derived (detected L2/3, conservative 256 KiB
fallback) so it stays L2-resident across hardware and never over-blocks a
cache it cannot see.

Bit-identical: it only reorders independent tiles (each computes its full-k
reduction into a disjoint C region). The block-edge floor of 1 degrades
exactly to the naive loop; the cap of 16 matches the multithread chunk_grid
blocking already shipped on all platforms. Frame-level, so all kernels
benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV /
multithreaded shapes are unchanged.

Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against
the naive reference (the existing frame proptests only reach 3 panels).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant
(it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt
--check` failed in CI. Pure formatting; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Passing 'v0.24.0' to post-release.sh writes 'version = "v0.24.0"' into
every Cargo.toml — invalid semver, breaks the workspace, easy to do by
muscle memory because the git tag does carry the 'v' prefix. Bail out
early in both release.sh and post-release.sh when the argument doesn't
match an unprefixed semver.
actions/checkout runs with persist-credentials: false, so the bare
'git push origin gh-pages' had no auth and failed with 'could not
read Username'. Use the workflow's GITHUB_TOKEN in the remote URL
instead — keeps zizmor happy while letting the deploy step push.
mathieupoumeyrolsonos and others added 2 commits June 1, 2026 13:24
arm64simd_mmm_i32_8x8_dot: an int8->i32 8x8 matmul kernel using SDOT
(FEAT_DotProd, ARMv8.2), ~4x the SMLAL 8x8 at the matmul level. Same v16..v31
tile layout as the SMLAL 8x8, so it reuses the existing i32 fuse/store/q_scale
machinery, and consumes the K=4-inner PackedI8K4 packing now upstream (sonos#2281).

- Gated on has_dotprod() (Apple M1+/A11+; Linux HWCAP_ASIMDDP). TRACT_DOTPROD_DISABLE=1
  forces the SMLAL 8x8 fallback so callers can A/B on one binary.
- Wired into qmmm_i32: int8 matmul/conv pick SDOT when FEAT_DotProd is present,
  SMLAL 8x8 otherwise. Relies on the merged dispatch fix (sonos#2277) to route 2D int8
  matmuls to a matrix kernel instead of the 64x1 GEMV.
- Adds linalg/benches/qmmm_i8.rs (SDOT vs SMLAL microbench).

The kernel is compiled in a separate cc::Build step gated on a build.rs assembler
probe (assembler_supports_dotprod). Old assemblers such as binutils 2.28 on Debian
stretch cannot encode `sdot` and fail the probe; the `tract_arm64_dotprod` cfg is
then not set, the kernel is omitted, and dispatch falls back to the SMLAL 8x8 i32
kernel. Follows the same pattern as the existing assembler_supports_sme probe.

Bit-exact vs the SMLAL kernel: linalg 114/114 (i8i8 + i32i32 fuse/frame + q_scale),
core int8 matmul 25/25. Apple M4 e2e (kernel unchanged from the original PR):
MiniLM 44.4->24.8 ms (1.79x), InceptionV1 51.6->28.4 ms (1.82x).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976 czoli1976 force-pushed the feat/int8-sdot-kernel branch from 5da6329 to 176ada9 Compare June 1, 2026 12:03
czoli1976 and others added 5 commits June 1, 2026 16:27
…8x8)

Route qmmm_i32 through VPDPBUSD when AVX-512 VNNI is available, replacing the
AVX2 per-K widening-multiply inner loop. Consumes the existing K=4-inner
PackedI8K4 layout; A is offset by +128 for VPDPBUSD's u8*s8 form and the
128*sum_k(B) bias is removed per output column, so the i32 accumulators stay
bit-identical to the AVX2 path and the whole quantization epilogue is reused.

Runtime-gated via where(AVX512VNNI); non-VNNI x86 keeps the AVX2 fallback.
Includes a vnni_i32 microbench (VNNI vs AVX2 int8).

The kernel lives in its own x86_64/avx512vnni/ subdirectory and is compiled
in a separate cc::Build step gated on a build.rs assembler probe
(assembler_supports_avx512vnni). Old assemblers such as binutils 2.28 on
Debian stretch cannot encode `vpdpbusd ymm` and will fail the probe; the
`tract_avx512vnni` cfg is then not set and the kernel is omitted entirely,
with dispatch falling back to the AVX2 i32 path. Follows the same pattern as
the existing SME (assembler_supports_sme) and SVE (compiler_supports_sve)
probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an ONNX LSTM/GRU/RNN exposes its full recurrent state both as input
and output (initial_h + Y_h, plus initial_c + Y_c for LSTM), the caller
manages state across calls. Set Scan::external_state in that case so the
existing declutter_single_loop pass can inline a single-iteration Scan
(seq_len == 1) — the streaming / autoregressive-decoder regime where the
one-iteration Scan is pure orchestration overhead.

Previously external_state was only reachable via the manual
force_scan_external_state transform, so streaming RNNs carried a dead Scan
on every call. Inlining is sound here because the body's State input is fed
from the outer (caller-supplied) input each call (see issue sonos#2157).

Measured on DTLN (an LSTM-heavy streaming denoiser): -8% end-to-end, output
unchanged at 110.47 dB / Pearson 1.00000 vs the native reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t flag

The previous importer heuristic set external_state whenever a GRU/LSTM node
carried initial_h/Y_h (and initial_c/Y_c). That mis-fires: DFN3's GRU nodes
also carry initial_h/Y_h, but their state is carried internally by tract under
pulse, not by the caller — so inlining the single-iteration Scan would break
it (the 0.23 regression kali flagged).

Move the decision into declutter_single_loop, which has the whole graph:
inline a single-iteration Scan only when every recurrent state has a
last-value output that reaches a model output, i.e. the caller can observe the
updated state and thread it back. Adds outlet_reaches_model_output.

DTLN (state feeds a model output) still inlines, output unchanged at
110.47 dB / Pearson 1.00000. DFN3 df_dec (Y_h reaches no model output, only
coefs) is not inlined.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
De-orphan and fix the latent zmm sigmoid/tanh kernels (tail-loop stride bugs
causing OOB stores for lengths not a multiple of 64), and add AVX-512
hardswish, leaky_relu, plus silu/gelu as compositions over the AVX-512
sigmoid/tanh. Runtime-gated on avx512f; non-AVX512 x86 keeps the FMA/generic
path.

Measured on Cascade Lake (single-thread): sigmoid 1.24x and tanh 1.29x over
the existing FMA paths; hardswish/leaky_relu/silu/gelu 5-21x over the generic
scalar paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants