linalg/x86_64: add AVX-512 element-wise activations by czoli1976 · Pull Request #4 · czoli1976/tract

czoli1976 · 2026-05-28T06:05:40Z

Summary

De-orphan and fix latent zmm sigmoid_f32 / tanh_f32 kernels (tail-loop stride bugs that caused OOB stores for lengths not a multiple of 64).
Add AVX-512 versions of hardswish and leaky_relu.
Add silu and gelu as compositions over the AVX-512 sigmoid / tanh.
Runtime-gated on avx512f; non-AVX512 x86 keeps the FMA / generic path.

Bench (local, single-thread, Cascade Lake, throughput Gelem/s):

sigmoid_f32: FMA 2.74 → AVX-512 3.41 (1.24×)
tanh_f32: FMA 2.78 → AVX-512 3.59 (1.29×)
hardswish_f32: generic 1.01 → AVX-512 15.4 (15.4×, vs scalar baseline)
leaky_relu_f32: generic 1.79 → AVX-512 26.2 (14.6×, vs scalar baseline)
silu_f32: generic 0.066 → AVX-512 1.39 (~21×, vs scalar baseline)
gelu_f32: generic 0.084 → AVX-512 0.50 (5.9×, vs scalar baseline)

An AVX-512 mul_by_scalar variant was prototyped but consistently regressed ~28% vs the existing FMA path on Cascade Lake (the op is too light to amortize the zmm frequency-license clock drop) and is therefore not included.

Test plan

cargo test --release -p tract-linalg — 2687 passed, 0 failed (includes non-multiple-of-64 lengths exercising the fixed tail loop)
Microbench: AVX-512 vs FMA / generic per activation
Non-AVX512 x86 hosts unchanged (fallback exercised via avx512f gating)

`TypedModelPatch::shunt_outside` leaves the shunted node in the graph, but the NNEF `patch` transform also implicitly removed model inputs whose name appeared on the LHS. That hidden side-effect made `patch` do two things at once: substitute a wire, and trim the interface. Drop the trimming. Add a sibling `select_inputs(inputs: [...])` transform shaped like `select_outputs`. The pulse pipeline now reads: -t 'patch(body: "length = tract_core_shape_of(input_signal)[1];")' \ -t 'select_inputs(inputs: ["input_signal"])' \ -t 'select_outputs(outputs: ["processed_signal"])' \ -t 'pulse(symbol: ..., pulse: ...)' Discarded Sources stay in the graph until declutter prunes them. Wire-up: `Graph::select_inputs_by_name` (mirror of `select_outputs_by_name`) + `with_inputs_by_name` + transform registration. Updated harness/nemotron + nemo-nemotron-asr + nemo-nemotron-streaming-asr to add the explicit `select_inputs` step.

The 'without-default-features' job in full.yml (cargo check -p tract-cli --no-default-features) regressed after the cuda-12XXX split: cudarc and tract-cuda were still pulled in unconditionally on linux/windows targets, so stripping the cuda-13000 default left cudarc with no API-version feature and its build script panicked. Make both deps optional in tract-cli and tract-libcli, and have each cuda-XXXXX feature pull them in (dep:cudarc + dep:tract-cuda + tract-cuda/cuda-XXXXX + tract-libcli/cuda). Adds a marker 'cuda' feature so cudarc-touching code in bench.rs / dump.rs / libcli/lib.rs can gate cleanly. test-cuda explicitly opts into cuda-13000 (workspace dep has default-features=false now), so 'cargo test -p tract-cuda -p test-cuda' keeps building.

Unify the four overlapping names for 'bind a symbol to a value across the model graph' under one verb: - core: `TypedModel::substitute_symbols` → `set_symbols` - core: `TypedOp::substitute_symbols` trait method → `set_symbols` - transform name: `concretize_symbols(values: …)` → `set_symbols(values: …)` - Rust API: `ConcretizeSymbols` → `SetSymbols` - Python API: `tract.ConcretizeSymbols` → `tract.SetSymbols` The CLI `--set B=1` flag was already aligned and is unchanged. No deprecation aliases — hard rename across cli, harness scripts, examples and Python bindings. The Rust API builder gains a `SetSymbols::expr(name, str)` companion to `value(name, i64)` so callers can pass TDim expressions (e.g. `'2*S'`) the way the CLI `--set` and the transform already do. `TDim::substitute` / `TDim::substitute_all` are unchanged: they operate on a single TDim expression, not on the model, and "substitute" is the accurate verb for that level.

The top-level `--set` flag was already TDim-aware via `parse_set_subs` in params.rs; the `run` subcommand had a parallel `--set` flag that only accepted plain i64. Parse RHS as a TDim against the model's symbol scope and reduce to i64 with the symbols set so far on the command line, so `run --set FOO=2 --set T=2*FOO` resolves cleanly. Order is CLI-significant: a symbol referenced on the RHS must be set to its left. Errors out with the unresolved name in the message.

The optimized Scan body runs the same plan with the same shapes every timestep, so resolve its symbols once, reset between iters without discarding them (reset_turn_keep_symbols), and reuse one drained input buffer -- instead of a full model_state.run() cycle (set_inputs -> resolve_symbols -> exec -> outputs -> reset_turn) per timestep. Bit-identical to the old path across GRU/LSTM/RNN + df_dec. No measurable wall-clock impact on fixed main (within +/-1% noise on gru/lstm/rnn 128/50 & 256/100 and df_dec, single-thread); kept as a cleanup of the per-iter re-entry path, not as a perf change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prefill-only GroupQueryAttention lowered onto tract Sdpa: reshapes Q/K/V to 4D, applies an explicit lower-triangular causal mask, and returns present_key/present_value (the reshaped K/V). Sdpa handles the grouped-query head sharing (kv_num_heads < num_heads). Decode-step KV cache, internal rotary (do_rotary), local-window attention and softcap are rejected with clear errors. Validated against onnxruntime across head_size 8/16/64, several num_heads/kv_num_heads ratios (incl. multi-query kv=1) and batch>1: attention output matches to <=3.6e-7 and present_key/present_value are bit-exact. ORT's GroupQueryAttention prefill is standard causal grouped-query attention; the seqlens_k input is the 0-indexed position of the last token (total_sequence_length - 1), not the token count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pin the resolved dependency graph so debug builds, release artifacts, SBOMs and security audits all see the same versions. CI runs against this lockfile; `cargo update` is the explicit knob for bumping deps.

For each tract-cli release artifact (per target triple), generate both CycloneDX and SPDX SBOMs from the workspace (Cargo.lock-driven) via anchore/sbom-action and upload them alongside the .tgz. Also pass --locked to the release build so the SBOM matches the resolved deps exactly. The sbom-action ref is currently the v0.18.0 tag — dependabot github-actions runs weekly and will SHA-pin on its next pass.

Each tract-cli release tarball now gets two GitHub attestations (CycloneDX and SPDX SBOM, via actions/attest-sbom). Anyone can verify after download with: gh attestation verify tract-<triple>-<version>.tgz --owner sonos Requires `id-token: write` + `attestations: write` on the job. sbom-action's `upload-release-assets: false` keeps the SBOM files out of its own upload path so the explicit softprops step is the single source of release artifacts.

- cargo auditable wraps the release build so the resolved dependency graph lands inside the binary itself. Consumers can recover the SBOM with `cargo audit bin tract` without needing the published .cdx.json / .spdx.json files. - actions/attest-build-provenance@v2 signs the .tgz with provenance metadata (workflow ref, commit SHA, runner). Combined with the existing SBOM attestations this lands at SLSA Build Level 3.

Pinned commits (latest stable as of writing): - anchore/sbom-action @ e22c389 (v0.24.0) - actions/attest-sbom @ c604332 (v4.1.0) - actions/attest-build-provenance @ a2bbfa2 (v4.1.0) Matches the existing SHA + comment convention used for actions/checkout and softprops/action-gh-release; dependabot's github-actions group will keep them current.

Two-part change so consumers can audit the deps that landed in the tract Python wheel without needing to re-clone the Rust workspace: 1. `api/py/pyproject.toml` (Linux + macOS cibuildwheel before-build): install cargo-auditable and write a one-line bash shim that prefixes `auditable` to every cargo invocation. setuptools_rust honours $CARGO (build.py:97), so pointing CARGO at the shim makes the Rust .so inside the wheel carry its dep graph in the `.dep-v0` ELF/Mach-O section. Windows wheels stay as-is for now (TODO comment). 2. `.github/workflows/wheels.yml` + `.github/scripts/inject_wheel_sboms.py`: after cibuildwheel emits each .whl, install syft (via anchore/sbom-action/download-syft, SHA-pinned), unpack the wheel, scan its contents (syft's rust-audit-binary cataloger reads the embedded cargo-auditable section), drop sbom.cdx.json + sbom.spdx.json into `<dist-info>/sboms/` per PEP 770, and re-pack via `wheel pack` (which regenerates RECORD with hashes). Smoke-tested locally on a sample wheel: SBOMs end up at the right path and RECORD has correct sha256 entries.

atty (0.2.x) is unmaintained and triggers RUSTSEC-2021-0145 on SBOM audits. It's only used in two places — both `is stderr a TTY` checks in `tract hwbench` — and std::io::IsTerminal (stable since 1.70, well below tract's MSRV) is a drop-in. `cargo tree -i atty` after the change reports the crate is no longer in the workspace dep graph.

De-orphan and fix the latent zmm sigmoid/tanh kernels (tail-loop stride bugs causing OOB stores for lengths not a multiple of 64), and add AVX-512 hardswish, leaky_relu, plus silu/gelu as compositions over the AVX-512 sigmoid/tanh. Runtime-gated on avx512f; non-AVX512 x86 keeps the FMA/generic path. Measured on Cascade Lake (single-thread): sigmoid 1.24x and tanh 1.29x over the existing FMA paths; hardswish/leaky_relu/silu/gelu 5-21x over the generic scalar paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

czoli1976 force-pushed the feat/avx512-activations branch 3 times, most recently from 3045e23 to d2c76b8 Compare May 28, 2026 08:55

czoli1976 changed the title ~~add AVX-512 element-wise activations (x86_64)~~ linalg/x86_64: add AVX-512 element-wise activations May 28, 2026

kali added 2 commits May 28, 2026 11:00

czoli1976 force-pushed the feat/avx512-activations branch from d2c76b8 to 55c21d0 Compare May 28, 2026 09:59

czoli1976 mentioned this pull request May 28, 2026

linalg/x86_64: add AVX-512 f16 element-wise activations #8

Open

3 tasks

kali and others added 10 commits May 28, 2026 13:42

build: track Cargo.lock

a5eeab5

Pin the resolved dependency graph so debug builds, release artifacts, SBOMs and security audits all see the same versions. CI runs against this lockfile; `cargo update` is the explicit knob for bumping deps.

czoli1976 mentioned this pull request May 28, 2026

linalg/x86_64: AVX-512_FP16 native f16 hardswish kernel #10

Open

7 tasks

kali and others added 3 commits May 28, 2026 17:39

changelog update

ff8abf2

czoli1976 force-pushed the feat/avx512-activations branch from 55c21d0 to 7cb4bd7 Compare May 29, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/x86_64: add AVX-512 element-wise activations#4

linalg/x86_64: add AVX-512 element-wise activations#4
czoli1976 wants to merge 15 commits into
base/sonos-mainfrom
feat/avx512-activations

czoli1976 commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

czoli1976 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

czoli1976 commented May 28, 2026 •

edited

Loading