linalg/x86_64: add AVX-512 erf kernel by czoli1976 · Pull Request #6 · czoli1976/tract

czoli1976 · 2026-05-28T09:40:20Z

Summary

New x86_64_avx512_erf_f32_64n kernel: AVX-512 (zmm, 16-wide) erf mirroring generic/erf.rs::serf (Abramowitz & Stegun 7.1.26 six-coefficient approximation), processing 64 lanes per iteration through 4 zmm registers + FMA Horner chains, final 1/(y+1)^16 via vdivps (full IEEE precision).
Wires into Ops::erf_f32 from plug_avx512f; non-AVX512 x86 keeps the generic scalar path (which the compiler typically auto-vectorizes to FMA).
New linalg/src/frame/erf.rs with erf_frame_tests! macro (mirrors frame/hardswish.rs structure) so both the generic SErf4 and the AVX-512 kernel share one proptest reference.

Bench (local, single-thread, Cascade Lake, throughput Gelem/s):

erf_f32 generic (compiler auto-vectorized to FMA): 0.81
erf_f32 AVX-512: 3.27 (4.05× generic)

The 4× headline reflects the bench host's auto-vectorized generic baseline (compiler emits FMA from the polynomial chain at -O3). On pre-FMA x86 the gap is much wider, since the compiler can't autovec the chain.

Test plan

cargo test --release -p tract-linalg — 2672 passed, 0 failed (6 new erf tests: 3 generic + 3 AVX-512)
cargo bench --bench erf — numbers above
Non-AVX512 x86 hosts unchanged (fallback exercised via avx512f gating in plug_avx512f)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Generated by Claude Code

`TypedModelPatch::shunt_outside` leaves the shunted node in the graph, but the NNEF `patch` transform also implicitly removed model inputs whose name appeared on the LHS. That hidden side-effect made `patch` do two things at once: substitute a wire, and trim the interface. Drop the trimming. Add a sibling `select_inputs(inputs: [...])` transform shaped like `select_outputs`. The pulse pipeline now reads: -t 'patch(body: "length = tract_core_shape_of(input_signal)[1];")' \ -t 'select_inputs(inputs: ["input_signal"])' \ -t 'select_outputs(outputs: ["processed_signal"])' \ -t 'pulse(symbol: ..., pulse: ...)' Discarded Sources stay in the graph until declutter prunes them. Wire-up: `Graph::select_inputs_by_name` (mirror of `select_outputs_by_name`) + `with_inputs_by_name` + transform registration. Updated harness/nemotron + nemo-nemotron-asr + nemo-nemotron-streaming-asr to add the explicit `select_inputs` step.

The 'without-default-features' job in full.yml (cargo check -p tract-cli --no-default-features) regressed after the cuda-12XXX split: cudarc and tract-cuda were still pulled in unconditionally on linux/windows targets, so stripping the cuda-13000 default left cudarc with no API-version feature and its build script panicked. Make both deps optional in tract-cli and tract-libcli, and have each cuda-XXXXX feature pull them in (dep:cudarc + dep:tract-cuda + tract-cuda/cuda-XXXXX + tract-libcli/cuda). Adds a marker 'cuda' feature so cudarc-touching code in bench.rs / dump.rs / libcli/lib.rs can gate cleanly. test-cuda explicitly opts into cuda-13000 (workspace dep has default-features=false now), so 'cargo test -p tract-cuda -p test-cuda' keeps building.

Unify the four overlapping names for 'bind a symbol to a value across the model graph' under one verb: - core: `TypedModel::substitute_symbols` → `set_symbols` - core: `TypedOp::substitute_symbols` trait method → `set_symbols` - transform name: `concretize_symbols(values: …)` → `set_symbols(values: …)` - Rust API: `ConcretizeSymbols` → `SetSymbols` - Python API: `tract.ConcretizeSymbols` → `tract.SetSymbols` The CLI `--set B=1` flag was already aligned and is unchanged. No deprecation aliases — hard rename across cli, harness scripts, examples and Python bindings. The Rust API builder gains a `SetSymbols::expr(name, str)` companion to `value(name, i64)` so callers can pass TDim expressions (e.g. `'2*S'`) the way the CLI `--set` and the transform already do. `TDim::substitute` / `TDim::substitute_all` are unchanged: they operate on a single TDim expression, not on the model, and "substitute" is the accurate verb for that level.

The top-level `--set` flag was already TDim-aware via `parse_set_subs` in params.rs; the `run` subcommand had a parallel `--set` flag that only accepted plain i64. Parse RHS as a TDim against the model's symbol scope and reduce to i64 with the symbols set so far on the command line, so `run --set FOO=2 --set T=2*FOO` resolves cleanly. Order is CLI-significant: a symbol referenced on the RHS must be set to its left. Errors out with the unresolved name in the message.

The optimized Scan body runs the same plan with the same shapes every timestep, so resolve its symbols once, reset between iters without discarding them (reset_turn_keep_symbols), and reuse one drained input buffer -- instead of a full model_state.run() cycle (set_inputs -> resolve_symbols -> exec -> outputs -> reset_turn) per timestep. Bit-identical to the old path across GRU/LSTM/RNN + df_dec. No measurable wall-clock impact on fixed main (within +/-1% noise on gru/lstm/rnn 128/50 & 256/100 and df_dec, single-thread); kept as a cleanup of the per-iter re-entry path, not as a perf change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prefill-only GroupQueryAttention lowered onto tract Sdpa: reshapes Q/K/V to 4D, applies an explicit lower-triangular causal mask, and returns present_key/present_value (the reshaped K/V). Sdpa handles the grouped-query head sharing (kv_num_heads < num_heads). Decode-step KV cache, internal rotary (do_rotary), local-window attention and softcap are rejected with clear errors. Validated against onnxruntime across head_size 8/16/64, several num_heads/kv_num_heads ratios (incl. multi-query kv=1) and batch>1: attention output matches to <=3.6e-7 and present_key/present_value are bit-exact. ORT's GroupQueryAttention prefill is standard causal grouped-query attention; the seqlens_k input is the 0-indexed position of the last token (total_sequence_length - 1), not the token count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pin the resolved dependency graph so debug builds, release artifacts, SBOMs and security audits all see the same versions. CI runs against this lockfile; `cargo update` is the explicit knob for bumping deps.

For each tract-cli release artifact (per target triple), generate both CycloneDX and SPDX SBOMs from the workspace (Cargo.lock-driven) via anchore/sbom-action and upload them alongside the .tgz. Also pass --locked to the release build so the SBOM matches the resolved deps exactly. The sbom-action ref is currently the v0.18.0 tag — dependabot github-actions runs weekly and will SHA-pin on its next pass.

Each tract-cli release tarball now gets two GitHub attestations (CycloneDX and SPDX SBOM, via actions/attest-sbom). Anyone can verify after download with: gh attestation verify tract-<triple>-<version>.tgz --owner sonos Requires `id-token: write` + `attestations: write` on the job. sbom-action's `upload-release-assets: false` keeps the SBOM files out of its own upload path so the explicit softprops step is the single source of release artifacts.

- cargo auditable wraps the release build so the resolved dependency graph lands inside the binary itself. Consumers can recover the SBOM with `cargo audit bin tract` without needing the published .cdx.json / .spdx.json files. - actions/attest-build-provenance@v2 signs the .tgz with provenance metadata (workflow ref, commit SHA, runner). Combined with the existing SBOM attestations this lands at SLSA Build Level 3.

Pinned commits (latest stable as of writing): - anchore/sbom-action @ e22c389 (v0.24.0) - actions/attest-sbom @ c604332 (v4.1.0) - actions/attest-build-provenance @ a2bbfa2 (v4.1.0) Matches the existing SHA + comment convention used for actions/checkout and softprops/action-gh-release; dependabot's github-actions group will keep them current.

Two-part change so consumers can audit the deps that landed in the tract Python wheel without needing to re-clone the Rust workspace: 1. `api/py/pyproject.toml` (Linux + macOS cibuildwheel before-build): install cargo-auditable and write a one-line bash shim that prefixes `auditable` to every cargo invocation. setuptools_rust honours $CARGO (build.py:97), so pointing CARGO at the shim makes the Rust .so inside the wheel carry its dep graph in the `.dep-v0` ELF/Mach-O section. Windows wheels stay as-is for now (TODO comment). 2. `.github/workflows/wheels.yml` + `.github/scripts/inject_wheel_sboms.py`: after cibuildwheel emits each .whl, install syft (via anchore/sbom-action/download-syft, SHA-pinned), unpack the wheel, scan its contents (syft's rust-audit-binary cataloger reads the embedded cargo-auditable section), drop sbom.cdx.json + sbom.spdx.json into `<dist-info>/sboms/` per PEP 770, and re-pack via `wheel pack` (which regenerates RECORD with hashes). Smoke-tested locally on a sample wheel: SBOMs end up at the right path and RECORD has correct sha256 entries.

atty (0.2.x) is unmaintained and triggers RUSTSEC-2021-0145 on SBOM audits. It's only used in two places — both `is stderr a TTY` checks in `tract hwbench` — and std::io::IsTerminal (stable since 1.70, well below tract's MSRV) is a drop-in. `cargo tree -i atty` after the change reports the crate is no longer in the workspace dep graph.

runtime_for_name("gpu") → first GPU backend whose `check()` passes (metal, then cuda); error if none are available. runtime_for_name("gpu-or-cpu") → same lookup, but falls through to the `default` CPU runtime instead of erroring. No new mechanism — both names walk the existing inventory and use each backend's existing `check()` to decide availability. Backend-specific names (`cuda`, `metal`) still work as before.

…or_name The CPU runtime now reports its own name as `cpu` (which is what it is), so `list-runtimes` shows `cpu`, `cuda`, `metal` … instead of the misleading `default`. Back-compat for callers passing `default` is handled by a one-line alias in `runtime_for_name` rather than by registering two runtimes or by polluting the trait — the alias only affects name lookup, not the inventory.

The `tensorflow` 0.21.0 crate (Rust binding for libtensorflow) was only pulled in behind the dead `conform` cargo feature — which gated `tract compare --tf` (compare tract output against running on libtensorflow on the same model). The feature isn't enabled in any GitHub workflow; only a stranded `.travis/tf.sh` ever ran it. The upstream `tensorflow` crate hasn't shipped since 2023-08-15 and pins to rust-protobuf 2.27.x, which trips RUSTSEC-2024-0437. Drop the feature and all its plumbing. Tract's own `.pb` parsing (used by `-t transformers_detect_all` and the `tf` cargo feature in tract-cli) goes through prost and is unaffected — the `tract-tensorflow` crate stays, just without the libtensorflow runtime. Cargo.lock shrinks by ~350 lines as a side-effect.

The LayerNorm op's `wire` expansion casts `normalized` back to fact.datum_type *before* applying scale/bias, then multiplies that result with `cast_scale` (which is still in self.datum_type, F32). With F16 inputs this becomes F16 × F32, whose output is downgraded to F32 by `mul()`. The inference rule then asserts `outputs[0].datum_type == inputs[0].datum_type` (F16) against the actual F32 output, failing `into_typed()` with: Output mismatch after rewiring expansion for output #0: expected 1,256,384,F16 got 1,256,384,F32 Fix: defer the cast back to fact.datum_type until after all scale/bias operations. Now the expansion stays entirely in self.datum_type (F32) through normalized × scale + bias, and casts only the final result. Behavior is unchanged for F32 inputs (the final cast is a no-op when fact.datum_type == self.datum_type). Reproduced with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 exported via `optimum.exporters.onnx.main_export(..., dtype="fp16")` and loaded with `into_optimized().into_runnable()`.

The single-thread MMM tile walk used a naive nested loop, re-streaming the full inner operand (all of A in col-outer / B in row-outer) per panel at large k, which is memory/L1-bound. The multithread path already 2D-blocks the panel grid (chunk_grid); this brings the same blocking to the single-thread path, with the block edge cache-derived (detected L2/3, conservative 256 KiB fallback) so it stays L2-resident across hardware and never over-blocks a cache it cannot see. Bit-identical: it only reorders independent tiles (each computes its full-k reduction into a disjoint C region). The block-edge floor of 1 degrades exactly to the naive loop; the cap of 16 matches the multithread chunk_grid blocking already shipped on all platforms. Frame-level, so all kernels benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV / multithreaded shapes are unchanged. Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against the naive reference (the existing frame proptests only reach 3 panels). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant (it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt --check` failed in CI. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Passing 'v0.24.0' to post-release.sh writes 'version = "v0.24.0"' into every Cargo.toml — invalid semver, breaks the workspace, easy to do by muscle memory because the git tag does carry the 'v' prefix. Bail out early in both release.sh and post-release.sh when the argument doesn't match an unprefixed semver.

For models trained with sliding-window attention (Mistral, Gemma-style local/global): a fixed-capacity ring buffer that overwrites the oldest slot on append, so decode runs at CONSTANT memory + per-step cost regardless of context length, losslessly (the model is trained to attend only within the window). Cheap because decode attention is ORDER-INVARIANT over keys (O = Σ softmax_j·V_j is unchanged under a (K,V) permutation), so the ring buffer never needs un-rotation — the consumer attends over the W physical slots as-is. Validated: holds the last-W as a set (incl. prefill chunk > window); windowed attention == ordered last-W attention (close, float summation order); memory bounded at W. Companion to the in-place cache (sonos#2321) = 'in-place cache with a cap + wraparound'. 3 tests, fmt+clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The com.microsoft GQA op carries the sliding window as a local_window_size attribute (verified: real Mistral-v0.1 exports encode it this way, all layers =4096) — not as an explicit mask. tract was rejecting it outright, so those models failed to import. Accept it and apply it as a banded causal mask in the existing concrete-shape mask path: query i attends to key j iff j <= i AND i - j < window. window=0 stays plain causal. Symbolic seq lengths bail (a static band can't be built; is_causal would silently widen to full attention). windowed_causal_mask helper + unit test (band correctness). Pairs with the bounded ring-buffer KV cache for the decode-side efficiency (separate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Stateful fused op owning K/V sliding-window ring buffers: each decode step appends K_new/V_new and attends Q over the (<=window) bounded cache. The bounded cache IS the sliding window, so attending over it equals windowed attention (causal=false: every cached key is within the current query's window) -> constant memory + per-step cost, losslessly. Op/EvalOp/TypedOp + OpState + freeze/unfreeze. Validated: window_sdpa_decode_matches_last_w_in_model builds a real TypedModel, runs it through tract's engine (into_runnable/spawn/run) over 15 decode steps with window=5, and matches full attention over the last-W each step, cache bounded to W throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

WindowKvSdpaTransform { window }: strips the GQA broadcast chain then fuses {DynKeyValueCache(K), DynKeyValueCache(V), Sdpa} -> WindowKvSdpa{window} (window threaded via the Rewriter context), so an imported decode model uses the bounded sliding-window cache. The window comes from the model (GQA local_window_size / config), supplied to the transform. Validated: transform_fuses_cache_sdpa_to_windowed_decode (caches+Sdpa removed, rewritten model does correct windowed decode vs full-attn-over-last-W). NNEF ser/de (tract_transformers_window_kv_sdpa, registered) + round-trip test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@kali

…#2327) Per @kali's review: the two-arm match was a convoluted way to clamp negatives to 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…es (sonos#2331)

Mirrors full.yml + cross-platform.yml. The workflow already runs on `pull_request:`; this adds the manual dispatch path for testing a specific PR (including from a fork) from the Actions tab. A new `prepare` job derives `test_ref` from the `pr_number` input or from the triggering pull_request, and every job now checks out at that ref.

…os#2332) The transform conflated two concerns: flipping external_state (a fixup for NNEF artifacts predating the flag) and substituting the scan-axis symbol with 1 model-wide. The latter is the caller's per-call seq=1 contract, needed only for declutter_single_loop's separate iters==1 gate, not implied by external state. Keep the transform flag-only; harnesses now drive inlining with an explicit -t set_symbols (T / TARGETS__TIME = 1) alongside the flag.

…omment (sonos#2334) Supersedes dependabot sonos#2333. Pins the v6.1.2 commit (acca2b1b) and replaces the floating '# v6' comment with '# v6.1.2'. zizmor's unpinned-uses flags a hash pin whose comment tag no longer resolves to the pinned SHA; a major-only tag drifts on every patch release, so use the exact version tag (immutable, matches the SHA).

* build: sync Cargo.lock workspace versions to 0.23.1-pre post-release 0.23.1-pre (688b476) bumped the crate manifests but left Cargo.lock at 0.23.0, so every cargo build rewrote these 24 workspace-member version entries and showed a spurious modified Cargo.lock. * release: sync Cargo.lock in post-release version bump post-release.sh edited manifests via tomato (which never invokes cargo) and committed without regenerating Cargo.lock, shipping a lock that mismatched the bumped versions. Add 'cargo update --workspace' before the commit. release.sh avoids this incidentally because cargo publish re-resolves the lock first.

com.microsoft.RotaryEmbedding is identical math to the standardized ai.onnx op but orders its inputs (input, position_ids, cos, sin). tract resolves ops by name regardless of domain, so make the single handler domain-aware and remap inputs accordingly. Rejects the contrib-only scale != 1.0 and is_packed_batching attributes with clear errors. Verified bit-exact against onnxruntime (3D, 4D, interleaved); ai.onnx RotaryEmbedding conformance unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…o stream axis

Curated behavioral subset of AGENTS.md (fmt, commit/comment style, model-edit tooling, public API, test placement, PR etiquette) inlined so it lands in the auto-loaded context; AGENTS.md stays the full reference.

The single "default to none" rule was suppressing doc comments along with inline narration. Scope the austerity to inline comments and add a doc-comment section that encourages concise item docs (contract, inputs, rule interactions) while keeping the same no-benchmarks/no-history rule.

…c + seq-len lowering heuristic P·V is computed as one contiguous tile GEMM (`s.dot(&vblock)`) instead of `head_dim` strided per-column dots; the strided column access defeated vectorization. Bit-exact (max_abs = 0 vs a naive softmax(QKᵀ·scale)·V ref). The independent (batch, q-head) tasks now run across cores on rayon's global pool — heads share only read-only Q/K/V and write disjoint output slices. Disable with TRACT_FLASH_SDPA_ST=1; single-threaded on wasm. The op scales ~5x across an Apple M1 Pro's 6 performance cores (compute-bound, not memory- bound in that range). Sdpa::codegen gains a sequence-length heuristic: an f32 SDPA whose K/V length is below TRACT_FLASH_SDPA_MIN_SEQ_LEN lowers to the decomposed matmul+softmax path instead of FlashSdpaOp. Default 0 keeps flash for every length (with head parallelism it beat the decomposed path at every size measured, 128–4096); raise it on hosts where short-sequence decompose wins. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wheels host setup now provisions from rust-toolchain.toml instead of a hardcoded dtolnay stable ref; the fmt check and the contributor rule move off pinned 1.91.0 rustfmt to stable, matching what the toml selects locally.

crates.yml reads rust-version from the workspace manifest and feeds it to the test matrix and the cuda-minimum-deploy job, so the MSRV lives in one place. cross-platform and examples drop their pinned-toolchain override and inherit stable from rust-toolchain.toml; MSRV stays covered by the crates matrix.

…st-toolchain.toml ci-system-setup.sh and native.sh forced RUSTUP_TOOLCHAIN=1.91.0 when unset, which overrode the toml across full.yml, large_models and cross.sh. Dropping the default lets them use stable; a caller-set RUSTUP_TOOLCHAIN is still honored.

The pinned SHA de0fac2e is actions/checkout v6.0.2, but the comments read # v6; that tag has since moved to df4cb1c, so zizmor flagged the mismatch. Label the exact release across the workflows this branch touches.

…kflows Same stale # v6 comment on the de0fac2e pin in the workflows this branch had not yet touched. Verified via the GitHub API that every other pinned action's version comment still resolves to its SHA; only checkout had drifted.

Add six f16 element-wise activations on x86 AVX-512: sigmoid_f16, tanh_f16, hardswish_f16, leaky_relu_f16, silu_f16, gelu_f16. Each kernel chunks the input through a 64-byte-aligned f32 scratch (CHUNK=256), dispatches to the matching f32 AVX-512 kernel (the avx512_sigmoid_f32 / avx512_tanh_f32 wrappers, or the act:: hardswish / leaky_relu / silu / gelu kernels), and converts back to f16. silu and gelu compose sigmoid_f32 / tanh_f32 with the final combine done in f32. The f16 <-> f32 conversion is driven by vcvtph2ps / vcvtps2ph via std::arch intrinsics (cvt_f16_to_f32 / cvt_f32_to_f16 helpers); rustc + LLVM do not autovectorize the scalar f16::to_f32 / f16::from_f32 loops, which is why a naive port leaves AVX-512 stuck at ~7 Melem/s. Wires into Ops::{sigmoid,tanh,hardswish,leaky_relu,silu,gelu}_f16 from plug_avx512f; non-AVX512 x86 keeps the generic scalar f16 kernels. Validated against the generic H<Op>8 reference via the existing *_frame_tests! macros at SuperApproximate tolerance, which covers the precision delta between scalar f16 arithmetic and f32-internal computation. Measured on Cascade Lake (single-thread, throughput Gelem/s): - sigmoid_f16: 0.016 -> 1.54 (96x) - tanh_f16: 0.018 -> 1.61 (92x) - hardswish_f16: 0.051 -> 9.46 (186x) - leaky_relu_f16: 0.96 -> 10.4 (11x; generic baseline is unexpectedly fast) - silu_f16: 0.20 -> 0.93 (4.6x) - gelu_f16: 0.11 -> 0.75 (6.7x) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds sme_qmmm_i32_32x32, a 32x32 int8->i32 matmul kernel using SME2 SMOPA (i8 outer-product), selected for qmmm_i32 when FEAT_SME2 is present. Consumes the same K=4-inner PackedI8K4 packing as the SDOT kernel and implements the int8 quant fuse ops (q_scale / rounding-shift / shift-left) bit-exactly (spill-ZA->scratch->reload); only LeakyRelu is unsupported. Builds on the SDOT kernel (sonos#2278) and dispatch fix (sonos#2277): needs PackedI8K4 plus the matmul/conv lowering from sonos#2278. sme_qmmm 114/114 on M4 (SME2, SVL=512), bit-exact vs the NEON kernels. core 244/244, linalg 3931/3931. Assembles + gates off cleanly on non-SME2 arm64 (kernel present, runtime-gated; M1 build + regression green). Apple M4 e2e, single MatMulInteger, vs the SDOT kernel (sonos#2278): 1024^3 4.67->0.95 ms (4.9x), 512^3 0.70->0.17 ms (4.0x), 128x768x3072 1.80->1.17 ms (1.54x), 32x2048x2048 1.33->0.90 ms (1.47x). Wash on small/overhead-bound matmuls (MiniLM/InceptionV1 seq=128); the win is compute-bound int8 GEMM (large batch/hidden, LLM prompt). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add zmm (16-wide) implementations of softmax2-fastcompact and max-reduce, overriding the FMA versions when avx512f is present; non-AVX512 x86 unchanged. Measured on Cascade Lake (single-thread): +16% on max-reduce and +54% on exp+sum vs the existing FMA assembly paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add x86_64_avx512_erf_f32_64n: AVX-512 (zmm, 16-wide) erf kernel mirroring generic/erf.rs::serf (Abramowitz & Stegun 7.1.26 six-coefficient approximation), processing 64 lanes per iteration via 4 zmm registers and FMA Horner chains. Wires into Ops::erf_f32 from plug_avx512f; non-AVX512 x86 keeps the generic scalar path. Also introduces linalg/src/frame/erf.rs with the erf_frame_tests! macro (mirrors frame/hardswish.rs structure) so both the generic and AVX-512 implementations share a single proptest reference. Measured on Cascade Lake (single-thread): ~4x over the autovectorized generic scalar on hosts with FMA; the gap is significantly larger on pre-FMA x86 where the compiler can't autovectorize the polynomial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kali added 2 commits May 28, 2026 11:00

czoli1976 force-pushed the feat/avx512-erf branch from b867814 to 7a40f25 Compare May 28, 2026 09:59

kali and others added 10 commits May 28, 2026 13:42

build: track Cargo.lock

a5eeab5

Pin the resolved dependency graph so debug builds, release artifacts, SBOMs and security audits all see the same versions. CI runs against this lockfile; `cargo update` is the explicit knob for bumping deps.

czoli1976 mentioned this pull request May 28, 2026

linalg/x86_64: AVX-512_FP16 native f16 hardswish kernel #10

Open

7 tasks

kali added 2 commits May 28, 2026 17:39

changelog update

ff8abf2

czoli1976 force-pushed the feat/avx512-erf branch from 7a40f25 to eda20d0 Compare May 29, 2026 08:12

kali added 5 commits May 29, 2026 10:25

fmt

5ca8dab

setup deny for cli

2614eea

czoli1976 force-pushed the feat/avx512-erf branch from eda20d0 to 508a534 Compare May 29, 2026 09:59

qingjie.du and others added 7 commits May 29, 2026 13:37

fmt

1d6d1aa

release 0.23.0-dev.6

9dcf880

post-release v0.23.0-pre

3ed1478

czoli1976 and others added 29 commits June 2, 2026 13:27

onnx/gqa: simplify local_window_size clamp with .max(0) (review sonos…

14a8024

…#2327) Per @kali's review: the two-arm match was a convoluted way to clamp negatives to 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

no need to freeze in Runnable::run (sonos#2328)

4e8bd21

core/ops/fft: Stft::axes_mapping unlocks pulsification on non-STFT ax…

8398ec0

…es (sonos#2331)

pulse/ops/array: linearity-checked per-pulse size for MultiBroadcastT…

e68a35d

…o stream axis

docs: add CLAUDE.md with contributor rules

986ad7c

Curated behavioral subset of AGENTS.md (fmt, commit/comment style, model-edit tooling, public API, test placement, PR etiquette) inlined so it lands in the auto-loaded context; AGENTS.md stays the full reference.

add rust-toolchain.toml pinning stable channel

6b8550c

consolidate CI on rust-toolchain.toml; check fmt with stable

51963c5

wheels host setup now provisions from rust-toolchain.toml instead of a hardcoded dtolnay stable ref; the fmt check and the contributor rule move off pinned 1.91.0 rustfmt to stable, matching what the toml selects locally.

docs: align AGENTS.md fmt rule with stable rust-toolchain.toml

5f5d7f5

comment rust-version as MSRV source of truth (keep README badge in sync)

7523a3b

ci: correct actions/checkout pin comments to v6.0.2

d143381

The pinned SHA de0fac2e is actions/checkout v6.0.2, but the comments read # v6; that tag has since moved to df4cb1c, so zizmor flagged the mismatch. Label the exact release across the workflows this branch touches.

core/ops/array: implement set_symbols on DynSlice and Topk

249812d

kali force-pushed the feat/avx512-erf branch from 508a534 to f80acd5 Compare June 4, 2026 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/x86_64: add AVX-512 erf kernel#6

linalg/x86_64: add AVX-512 erf kernel#6
czoli1976 wants to merge 67 commits into
base/sonos-mainfrom
feat/avx512-erf

czoli1976 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

czoli1976 commented May 28, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants