Skip to content

linalg/x86_64: add AVX-512 element-wise activations#4

Open
czoli1976 wants to merge 15 commits into
base/sonos-mainfrom
feat/avx512-activations
Open

linalg/x86_64: add AVX-512 element-wise activations#4
czoli1976 wants to merge 15 commits into
base/sonos-mainfrom
feat/avx512-activations

Conversation

@czoli1976

@czoli1976 czoli1976 commented May 28, 2026

Copy link
Copy Markdown
Owner

Summary

  • De-orphan and fix latent zmm sigmoid_f32 / tanh_f32 kernels (tail-loop stride bugs that caused OOB stores for lengths not a multiple of 64).
  • Add AVX-512 versions of hardswish and leaky_relu.
  • Add silu and gelu as compositions over the AVX-512 sigmoid / tanh.
  • Runtime-gated on avx512f; non-AVX512 x86 keeps the FMA / generic path.

Bench (local, single-thread, Cascade Lake, throughput Gelem/s):

  • sigmoid_f32: FMA 2.74 → AVX-512 3.41 (1.24×)
  • tanh_f32: FMA 2.78 → AVX-512 3.59 (1.29×)
  • hardswish_f32: generic 1.01 → AVX-512 15.4 (15.4×, vs scalar baseline)
  • leaky_relu_f32: generic 1.79 → AVX-512 26.2 (14.6×, vs scalar baseline)
  • silu_f32: generic 0.066 → AVX-512 1.39 (~21×, vs scalar baseline)
  • gelu_f32: generic 0.084 → AVX-512 0.50 (5.9×, vs scalar baseline)

An AVX-512 mul_by_scalar variant was prototyped but consistently regressed ~28% vs the existing FMA path on Cascade Lake (the op is too light to amortize the zmm frequency-license clock drop) and is therefore not included.

Test plan

  • cargo test --release -p tract-linalg — 2687 passed, 0 failed (includes non-multiple-of-64 lengths exercising the fixed tail loop)
  • Microbench: AVX-512 vs FMA / generic per activation
  • Non-AVX512 x86 hosts unchanged (fallback exercised via avx512f gating)

@czoli1976 czoli1976 force-pushed the feat/avx512-activations branch 3 times, most recently from 3045e23 to d2c76b8 Compare May 28, 2026 08:55
@czoli1976 czoli1976 changed the title add AVX-512 element-wise activations (x86_64) linalg/x86_64: add AVX-512 element-wise activations May 28, 2026
kali added 2 commits May 28, 2026 11:00
`TypedModelPatch::shunt_outside` leaves the shunted node in the graph,
but the NNEF `patch` transform also implicitly removed model inputs
whose name appeared on the LHS.  That hidden side-effect made
`patch` do two things at once: substitute a wire, and trim the
interface.  Drop the trimming.

Add a sibling `select_inputs(inputs: [...])` transform shaped like
`select_outputs`.  The pulse pipeline now reads:

  -t 'patch(body: "length = tract_core_shape_of(input_signal)[1];")' \
  -t 'select_inputs(inputs: ["input_signal"])'              \
  -t 'select_outputs(outputs: ["processed_signal"])'        \
  -t 'pulse(symbol: ..., pulse: ...)'

Discarded Sources stay in the graph until declutter prunes them.

Wire-up: `Graph::select_inputs_by_name` (mirror of
`select_outputs_by_name`) + `with_inputs_by_name` + transform
registration.  Updated harness/nemotron + nemo-nemotron-asr +
nemo-nemotron-streaming-asr to add the explicit `select_inputs` step.
The 'without-default-features' job in full.yml (cargo check -p tract-cli
--no-default-features) regressed after the cuda-12XXX split: cudarc and
tract-cuda were still pulled in unconditionally on linux/windows targets,
so stripping the cuda-13000 default left cudarc with no API-version
feature and its build script panicked.

Make both deps optional in tract-cli and tract-libcli, and have each
cuda-XXXXX feature pull them in (dep:cudarc + dep:tract-cuda +
tract-cuda/cuda-XXXXX + tract-libcli/cuda).  Adds a marker 'cuda'
feature so cudarc-touching code in bench.rs / dump.rs / libcli/lib.rs
can gate cleanly.

test-cuda explicitly opts into cuda-13000 (workspace dep has
default-features=false now), so 'cargo test -p tract-cuda -p test-cuda'
keeps building.
kali and others added 10 commits May 28, 2026 13:42
Unify the four overlapping names for 'bind a symbol to a value across
the model graph' under one verb:

  - core: `TypedModel::substitute_symbols` → `set_symbols`
  - core: `TypedOp::substitute_symbols` trait method → `set_symbols`
  - transform name: `concretize_symbols(values: …)` → `set_symbols(values: …)`
  - Rust API: `ConcretizeSymbols` → `SetSymbols`
  - Python API: `tract.ConcretizeSymbols` → `tract.SetSymbols`

The CLI `--set B=1` flag was already aligned and is unchanged.  No
deprecation aliases — hard rename across cli, harness scripts, examples
and Python bindings.

The Rust API builder gains a `SetSymbols::expr(name, str)` companion
to `value(name, i64)` so callers can pass TDim expressions (e.g.
`'2*S'`) the way the CLI `--set` and the transform already do.

`TDim::substitute` / `TDim::substitute_all` are unchanged: they
operate on a single TDim expression, not on the model, and "substitute"
is the accurate verb for that level.
The top-level `--set` flag was already TDim-aware via `parse_set_subs`
in params.rs; the `run` subcommand had a parallel `--set` flag that
only accepted plain i64.  Parse RHS as a TDim against the model's
symbol scope and reduce to i64 with the symbols set so far on the
command line, so `run --set FOO=2 --set T=2*FOO` resolves cleanly.

Order is CLI-significant: a symbol referenced on the RHS must be set
to its left.  Errors out with the unresolved name in the message.
The optimized Scan body runs the same plan with the same shapes every
timestep, so resolve its symbols once, reset between iters without
discarding them (reset_turn_keep_symbols), and reuse one drained input
buffer -- instead of a full model_state.run() cycle (set_inputs ->
resolve_symbols -> exec -> outputs -> reset_turn) per timestep.

Bit-identical to the old path across GRU/LSTM/RNN + df_dec. No measurable
wall-clock impact on fixed main (within +/-1% noise on gru/lstm/rnn 128/50
& 256/100 and df_dec, single-thread); kept as a cleanup of the per-iter
re-entry path, not as a perf change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prefill-only GroupQueryAttention lowered onto tract Sdpa: reshapes Q/K/V to
4D, applies an explicit lower-triangular causal mask, and returns
present_key/present_value (the reshaped K/V). Sdpa handles the grouped-query
head sharing (kv_num_heads < num_heads). Decode-step KV cache, internal
rotary (do_rotary), local-window attention and softcap are rejected with
clear errors.

Validated against onnxruntime across head_size 8/16/64, several
num_heads/kv_num_heads ratios (incl. multi-query kv=1) and batch>1: attention
output matches to <=3.6e-7 and present_key/present_value are bit-exact.

ORT's GroupQueryAttention prefill is standard causal grouped-query attention;
the seqlens_k input is the 0-indexed position of the last token
(total_sequence_length - 1), not the token count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin the resolved dependency graph so debug builds, release artifacts,
SBOMs and security audits all see the same versions.  CI runs against
this lockfile; `cargo update` is the explicit knob for bumping deps.
For each tract-cli release artifact (per target triple), generate both
CycloneDX and SPDX SBOMs from the workspace (Cargo.lock-driven) via
anchore/sbom-action and upload them alongside the .tgz.  Also pass
--locked to the release build so the SBOM matches the resolved deps
exactly.

The sbom-action ref is currently the v0.18.0 tag — dependabot
github-actions runs weekly and will SHA-pin on its next pass.
Each tract-cli release tarball now gets two GitHub attestations
(CycloneDX and SPDX SBOM, via actions/attest-sbom).  Anyone can
verify after download with:

  gh attestation verify tract-<triple>-<version>.tgz --owner sonos

Requires `id-token: write` + `attestations: write` on the job.
sbom-action's `upload-release-assets: false` keeps the SBOM files
out of its own upload path so the explicit softprops step is the
single source of release artifacts.
- cargo auditable wraps the release build so the resolved dependency
  graph lands inside the binary itself.  Consumers can recover the
  SBOM with `cargo audit bin tract` without needing the published
  .cdx.json / .spdx.json files.
- actions/attest-build-provenance@v2 signs the .tgz with provenance
  metadata (workflow ref, commit SHA, runner).  Combined with the
  existing SBOM attestations this lands at SLSA Build Level 3.
Pinned commits (latest stable as of writing):
- anchore/sbom-action @ e22c389 (v0.24.0)
- actions/attest-sbom @ c604332 (v4.1.0)
- actions/attest-build-provenance @ a2bbfa2 (v4.1.0)

Matches the existing SHA + comment convention used for
actions/checkout and softprops/action-gh-release; dependabot's
github-actions group will keep them current.
Two-part change so consumers can audit the deps that landed in the
tract Python wheel without needing to re-clone the Rust workspace:

1. `api/py/pyproject.toml` (Linux + macOS cibuildwheel before-build):
   install cargo-auditable and write a one-line bash shim that
   prefixes `auditable` to every cargo invocation.  setuptools_rust
   honours $CARGO (build.py:97), so pointing CARGO at the shim makes
   the Rust .so inside the wheel carry its dep graph in the
   `.dep-v0` ELF/Mach-O section.  Windows wheels stay as-is for now
   (TODO comment).

2. `.github/workflows/wheels.yml` + `.github/scripts/inject_wheel_sboms.py`:
   after cibuildwheel emits each .whl, install syft (via
   anchore/sbom-action/download-syft, SHA-pinned), unpack the wheel,
   scan its contents (syft's rust-audit-binary cataloger reads the
   embedded cargo-auditable section), drop sbom.cdx.json +
   sbom.spdx.json into `<dist-info>/sboms/` per PEP 770, and
   re-pack via `wheel pack` (which regenerates RECORD with hashes).

Smoke-tested locally on a sample wheel: SBOMs end up at the right
path and RECORD has correct sha256 entries.
kali and others added 3 commits May 28, 2026 17:39
atty (0.2.x) is unmaintained and triggers RUSTSEC-2021-0145 on SBOM
audits.  It's only used in two places — both `is stderr a TTY`
checks in `tract hwbench` — and std::io::IsTerminal (stable since
1.70, well below tract's MSRV) is a drop-in.

`cargo tree -i atty` after the change reports the crate is no longer
in the workspace dep graph.
De-orphan and fix the latent zmm sigmoid/tanh kernels (tail-loop stride bugs
causing OOB stores for lengths not a multiple of 64), and add AVX-512
hardswish, leaky_relu, plus silu/gelu as compositions over the AVX-512
sigmoid/tanh. Runtime-gated on avx512f; non-AVX512 x86 keeps the FMA/generic
path.

Measured on Cascade Lake (single-thread): sigmoid 1.24x and tanh 1.29x over
the existing FMA paths; hardswish/leaky_relu/silu/gelu 5-21x over the generic
scalar paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976 czoli1976 force-pushed the feat/avx512-activations branch from 55c21d0 to 7cb4bd7 Compare May 29, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants