linalg/x86_64: add AVX-512 f16 element-wise activations by czoli1976 · Pull Request #8 · czoli1976/tract

czoli1976 · 2026-05-28T10:45:58Z

Summary

Stacked on PR #4 (feat/avx512-activations). Adds six f16 element-wise activations on x86 AVX-512:

sigmoid_f16, tanh_f16 (compose over PR linalg/x86_64: add AVX-512 element-wise activations #4's avx512_sigmoid_f32 / avx512_tanh_f32)
hardswish_f16, leaky_relu_f16 (compose over PR linalg/x86_64: add AVX-512 element-wise activations #4's f32 kernels of the same op)
silu_f16, gelu_f16 (compose at the kernel level: f32 polynomial + scalar combine in f32, mirroring PR linalg/x86_64: add AVX-512 element-wise activations #4's f32 versions)

Each kernel chunks the input through a 64-byte-aligned f32 scratch (CHUNK = 256), runs the matching f32 AVX-512 kernel, and converts back to f16. f16↔f32 conversion is driven by vcvtph2ps / vcvtps2ph via std::arch intrinsics (helpers cvt_f16_to_f32 / cvt_f32_to_f16) — rustc + LLVM do NOT autovectorize the scalar f16::to_f32 / f16::from_f32 loops, which is why a naive port leaves AVX-512 stuck at ~7 Melem/s. The intrinsics-based path gets back to the per-op AVX-512 ceiling.

Wires into Ops::{sigmoid,tanh,hardswish,leaky_relu,silu,gelu}_f16 from plug_avx512f; non-AVX512 x86 keeps the generic scalar f16 kernels.

Bench (local, single-thread, Cascade Lake, throughput Gelem/s):

op	generic	AVX-512	speedup
sigmoid_f16	0.016	1.54	96×
tanh_f16	0.018	1.61	92×
hardswish_f16	0.051	9.46	186×
leaky_relu_f16	0.96	10.4	11×
silu_f16	0.20	0.93	4.6×
gelu_f16	0.11	0.75	6.7×

leaky_relu and silu show smaller ratios because the generic baseline is already faster than the sigmoid/tanh polynomial paths (leaky_relu is just a max + multiply; silu uses a generic sigmoid that's faster than HSigmoid8 in isolation).

Test plan

cargo test --release -p tract-linalg — 2708 passed, 0 failed (21 new f16 activation tests: 3-7 cases per op, including proptest)
cargo bench --bench activations_avx512_f16 — numbers above
Non-AVX512 x86 hosts unchanged (fallback exercised via avx512f gating in plug_avx512f)

Dependencies

This PR is stacked on PR #4 (feat/avx512-activations). It imports act::x86_64_avx512_hardswish_f32_64n, act::x86_64_avx512_leaky_relu_f32_64n, avx512_sigmoid_f32, avx512_tanh_f32. If PR #4 lands upstream first, this PR rebases trivially. If shipping standalone, it would also need PR #4's content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Generated by Claude Code

Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids / Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16 each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip, no vcvtph2ps/vcvtps2ph at the IO boundary. Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension). Bench on Sapphire Rapids (n=1024, single thread, Criterion): hardswish_f16: generic 52.3 Melem/s avx512_f32roundtrip 8.71 Gelem/s (current #8 path) avx512fp16_native 31.6 Gelem/s (this PR, 3.62× over the roundtrip) A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44 Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not saturate the FP16 execution port the same way the equivalent f32 ops saturate the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest against the f16 reference); kept in the source for future revisit on different fp16 uarchs where the comparison might flip. Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch `cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown (plug_avx512fp16 is x86_64-only and feature-gated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

runtime_for_name("gpu") → first GPU backend whose `check()` passes (metal, then cuda); error if none are available. runtime_for_name("gpu-or-cpu") → same lookup, but falls through to the `default` CPU runtime instead of erroring. No new mechanism — both names walk the existing inventory and use each backend's existing `check()` to decide availability. Backend-specific names (`cuda`, `metal`) still work as before.

…or_name The CPU runtime now reports its own name as `cpu` (which is what it is), so `list-runtimes` shows `cpu`, `cuda`, `metal` … instead of the misleading `default`. Back-compat for callers passing `default` is handled by a one-line alias in `runtime_for_name` rather than by registering two runtimes or by polluting the trait — the alias only affects name lookup, not the inventory.

The `tensorflow` 0.21.0 crate (Rust binding for libtensorflow) was only pulled in behind the dead `conform` cargo feature — which gated `tract compare --tf` (compare tract output against running on libtensorflow on the same model). The feature isn't enabled in any GitHub workflow; only a stranded `.travis/tf.sh` ever ran it. The upstream `tensorflow` crate hasn't shipped since 2023-08-15 and pins to rust-protobuf 2.27.x, which trips RUSTSEC-2024-0437. Drop the feature and all its plumbing. Tract's own `.pb` parsing (used by `-t transformers_detect_all` and the `tf` cargo feature in tract-cli) goes through prost and is unaffected — the `tract-tensorflow` crate stays, just without the libtensorflow runtime. Cargo.lock shrinks by ~350 lines as a side-effect.

The LayerNorm op's `wire` expansion casts `normalized` back to fact.datum_type *before* applying scale/bias, then multiplies that result with `cast_scale` (which is still in self.datum_type, F32). With F16 inputs this becomes F16 × F32, whose output is downgraded to F32 by `mul()`. The inference rule then asserts `outputs[0].datum_type == inputs[0].datum_type` (F16) against the actual F32 output, failing `into_typed()` with: Output mismatch after rewiring expansion for output #0: expected 1,256,384,F16 got 1,256,384,F32 Fix: defer the cast back to fact.datum_type until after all scale/bias operations. Now the expansion stays entirely in self.datum_type (F32) through normalized × scale + bias, and casts only the final result. Behavior is unchanged for F32 inputs (the final cast is a no-op when fact.datum_type == self.datum_type). Reproduced with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 exported via `optimum.exporters.onnx.main_export(..., dtype="fp16")` and loaded with `into_optimized().into_runnable()`.

The single-thread MMM tile walk used a naive nested loop, re-streaming the full inner operand (all of A in col-outer / B in row-outer) per panel at large k, which is memory/L1-bound. The multithread path already 2D-blocks the panel grid (chunk_grid); this brings the same blocking to the single-thread path, with the block edge cache-derived (detected L2/3, conservative 256 KiB fallback) so it stays L2-resident across hardware and never over-blocks a cache it cannot see. Bit-identical: it only reorders independent tiles (each computes its full-k reduction into a disjoint C region). The block-edge floor of 1 degrades exactly to the naive loop; the cap of 16 matches the multithread chunk_grid blocking already shipped on all platforms. Frame-level, so all kernels benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV / multithreaded shapes are unchanged. Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against the naive reference (the existing frame proptests only reach 3 panels). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant (it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt --check` failed in CI. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Passing 'v0.24.0' to post-release.sh writes 'version = "v0.24.0"' into every Cargo.toml — invalid semver, breaks the workspace, easy to do by muscle memory because the git tag does carry the 'v' prefix. Bail out early in both release.sh and post-release.sh when the argument doesn't match an unprefixed semver.

actions/checkout runs with persist-credentials: false, so the bare 'git push origin gh-pages' had no auth and failed with 'could not read Username'. Use the workflow's GITHUB_TOKEN in the remote URL instead — keeps zizmor happy while letting the deploy step push.

arm64simd_mmm_i32_8x8_dot: an int8->i32 8x8 matmul kernel using SDOT (FEAT_DotProd, ARMv8.2), ~4x the SMLAL 8x8 at the matmul level. Same v16..v31 tile layout as the SMLAL 8x8, so it reuses the existing i32 fuse/store/q_scale machinery, and consumes the K=4-inner PackedI8K4 packing now upstream (sonos#2281). - Gated on has_dotprod() (Apple M1+/A11+; Linux HWCAP_ASIMDDP). TRACT_DOTPROD_DISABLE=1 forces the SMLAL 8x8 fallback so callers can A/B on one binary. - Wired into qmmm_i32: int8 matmul/conv pick SDOT when FEAT_DotProd is present, SMLAL 8x8 otherwise. Relies on the merged dispatch fix (sonos#2277) to route 2D int8 matmuls to a matrix kernel instead of the 64x1 GEMV. - Adds linalg/benches/qmmm_i8.rs (SDOT vs SMLAL microbench). The kernel is compiled in a separate cc::Build step gated on a build.rs assembler probe (assembler_supports_dotprod). Old assemblers such as binutils 2.28 on Debian stretch cannot encode `sdot` and fail the probe; the `tract_arm64_dotprod` cfg is then not set, the kernel is omitted, and dispatch falls back to the SMLAL 8x8 i32 kernel. Follows the same pattern as the existing assembler_supports_sme probe. Bit-exact vs the SMLAL kernel: linalg 114/114 (i8i8 + i32i32 fuse/frame + q_scale), core int8 matmul 25/25. Apple M4 e2e (kernel unchanged from the original PR): MiniLM 44.4->24.8 ms (1.79x), InceptionV1 51.6->28.4 ms (1.82x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…8x8) Route qmmm_i32 through VPDPBUSD when AVX-512 VNNI is available, replacing the AVX2 per-K widening-multiply inner loop. Consumes the existing K=4-inner PackedI8K4 layout; A is offset by +128 for VPDPBUSD's u8*s8 form and the 128*sum_k(B) bias is removed per output column, so the i32 accumulators stay bit-identical to the AVX2 path and the whole quantization epilogue is reused. Runtime-gated via where(AVX512VNNI); non-VNNI x86 keeps the AVX2 fallback. Includes a vnni_i32 microbench (VNNI vs AVX2 int8). The kernel lives in its own x86_64/avx512vnni/ subdirectory and is compiled in a separate cc::Build step gated on a build.rs assembler probe (assembler_supports_avx512vnni). Old assemblers such as binutils 2.28 on Debian stretch cannot encode `vpdpbusd ymm` and will fail the probe; the `tract_avx512vnni` cfg is then not set and the kernel is omitted entirely, with dispatch falling back to the AVX2 i32 path. Follows the same pattern as the existing SME (assembler_supports_sme) and SVE (compiler_supports_sve) probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When an ONNX LSTM/GRU/RNN exposes its full recurrent state both as input and output (initial_h + Y_h, plus initial_c + Y_c for LSTM), the caller manages state across calls. Set Scan::external_state in that case so the existing declutter_single_loop pass can inline a single-iteration Scan (seq_len == 1) — the streaming / autoregressive-decoder regime where the one-iteration Scan is pure orchestration overhead. Previously external_state was only reachable via the manual force_scan_external_state transform, so streaming RNNs carried a dead Scan on every call. Inlining is sound here because the body's State input is fed from the outer (caller-supplied) input each call (see issue sonos#2157). Measured on DTLN (an LSTM-heavy streaming denoiser): -8% end-to-end, output unchanged at 110.47 dB / Pearson 1.00000 vs the native reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t flag The previous importer heuristic set external_state whenever a GRU/LSTM node carried initial_h/Y_h (and initial_c/Y_c). That mis-fires: DFN3's GRU nodes also carry initial_h/Y_h, but their state is carried internally by tract under pulse, not by the caller — so inlining the single-iteration Scan would break it (the 0.23 regression kali flagged). Move the decision into declutter_single_loop, which has the whole graph: inline a single-iteration Scan only when every recurrent state has a last-value output that reaches a model output, i.e. the caller can observe the updated state and thread it back. Adds outlet_reaches_model_output. DTLN (state feeds a model output) still inlines, output unchanged at 110.47 dB / Pearson 1.00000. DFN3 df_dec (Y_h reaches no model output, only coefs) is not inlined. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

De-orphan and fix the latent zmm sigmoid/tanh kernels (tail-loop stride bugs causing OOB stores for lengths not a multiple of 64), and add AVX-512 hardswish, leaky_relu, plus silu/gelu as compositions over the AVX-512 sigmoid/tanh. Runtime-gated on avx512f; non-AVX512 x86 keeps the FMA/generic path. Measured on Cascade Lake (single-thread): sigmoid 1.24x and tanh 1.29x over the existing FMA paths; hardswish/leaky_relu/silu/gelu 5-21x over the generic scalar paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing

For models trained with sliding-window attention (Mistral, Gemma-style local/global): a fixed-capacity ring buffer that overwrites the oldest slot on append, so decode runs at CONSTANT memory + per-step cost regardless of context length, losslessly (the model is trained to attend only within the window). Cheap because decode attention is ORDER-INVARIANT over keys (O = Σ softmax_j·V_j is unchanged under a (K,V) permutation), so the ring buffer never needs un-rotation — the consumer attends over the W physical slots as-is. Validated: holds the last-W as a set (incl. prefill chunk > window); windowed attention == ordered last-W attention (close, float summation order); memory bounded at W. Companion to the in-place cache (sonos#2321) = 'in-place cache with a cap + wraparound'. 3 tests, fmt+clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The com.microsoft GQA op carries the sliding window as a local_window_size attribute (verified: real Mistral-v0.1 exports encode it this way, all layers =4096) — not as an explicit mask. tract was rejecting it outright, so those models failed to import. Accept it and apply it as a banded causal mask in the existing concrete-shape mask path: query i attends to key j iff j <= i AND i - j < window. window=0 stays plain causal. Symbolic seq lengths bail (a static band can't be built; is_causal would silently widen to full attention). windowed_causal_mask helper + unit test (band correctness). Pairs with the bounded ring-buffer KV cache for the decode-side efficiency (separate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Stateful fused op owning K/V sliding-window ring buffers: each decode step appends K_new/V_new and attends Q over the (<=window) bounded cache. The bounded cache IS the sliding window, so attending over it equals windowed attention (causal=false: every cached key is within the current query's window) -> constant memory + per-step cost, losslessly. Op/EvalOp/TypedOp + OpState + freeze/unfreeze. Validated: window_sdpa_decode_matches_last_w_in_model builds a real TypedModel, runs it through tract's engine (into_runnable/spawn/run) over 15 decode steps with window=5, and matches full attention over the last-W each step, cache bounded to W throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

WindowKvSdpaTransform { window }: strips the GQA broadcast chain then fuses {DynKeyValueCache(K), DynKeyValueCache(V), Sdpa} -> WindowKvSdpa{window} (window threaded via the Rewriter context), so an imported decode model uses the bounded sliding-window cache. The window comes from the model (GQA local_window_size / config), supplied to the transform. Validated: transform_fuses_cache_sdpa_to_windowed_decode (caches+Sdpa removed, rewritten model does correct windowed decode vs full-attn-over-last-W). NNEF ser/de (tract_transformers_window_kv_sdpa, registered) + round-trip test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@kali

…#2327) Per @kali's review: the two-arm match was a convoluted way to clamp negatives to 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…es (sonos#2331)

Mirrors full.yml + cross-platform.yml. The workflow already runs on `pull_request:`; this adds the manual dispatch path for testing a specific PR (including from a fork) from the Actions tab. A new `prepare` job derives `test_ref` from the `pr_number` input or from the triggering pull_request, and every job now checks out at that ref.

…os#2332) The transform conflated two concerns: flipping external_state (a fixup for NNEF artifacts predating the flag) and substituting the scan-axis symbol with 1 model-wide. The latter is the caller's per-call seq=1 contract, needed only for declutter_single_loop's separate iters==1 gate, not implied by external state. Keep the transform flag-only; harnesses now drive inlining with an explicit -t set_symbols (T / TARGETS__TIME = 1) alongside the flag.

…omment (sonos#2334) Supersedes dependabot sonos#2333. Pins the v6.1.2 commit (acca2b1b) and replaces the floating '# v6' comment with '# v6.1.2'. zizmor's unpinned-uses flags a hash pin whose comment tag no longer resolves to the pinned SHA; a major-only tag drifts on every patch release, so use the exact version tag (immutable, matches the SHA).

* build: sync Cargo.lock workspace versions to 0.23.1-pre post-release 0.23.1-pre (688b476) bumped the crate manifests but left Cargo.lock at 0.23.0, so every cargo build rewrote these 24 workspace-member version entries and showed a spurious modified Cargo.lock. * release: sync Cargo.lock in post-release version bump post-release.sh edited manifests via tomato (which never invokes cargo) and committed without regenerating Cargo.lock, shipping a lock that mismatched the bumped versions. Add 'cargo update --workspace' before the commit. release.sh avoids this incidentally because cargo publish re-resolves the lock first.

com.microsoft.RotaryEmbedding is identical math to the standardized ai.onnx op but orders its inputs (input, position_ids, cos, sin). tract resolves ops by name regardless of domain, so make the single handler domain-aware and remap inputs accordingly. Rejects the contrib-only scale != 1.0 and is_packed_batching attributes with clear errors. Verified bit-exact against onnxruntime (3D, 4D, interleaved); ai.onnx RotaryEmbedding conformance unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…o stream axis

Curated behavioral subset of AGENTS.md (fmt, commit/comment style, model-edit tooling, public API, test placement, PR etiquette) inlined so it lands in the auto-loaded context; AGENTS.md stays the full reference.

The single "default to none" rule was suppressing doc comments along with inline narration. Scope the austerity to inline comments and add a doc-comment section that encourages concise item docs (contract, inputs, rule interactions) while keeping the same no-benchmarks/no-history rule.

…c + seq-len lowering heuristic P·V is computed as one contiguous tile GEMM (`s.dot(&vblock)`) instead of `head_dim` strided per-column dots; the strided column access defeated vectorization. Bit-exact (max_abs = 0 vs a naive softmax(QKᵀ·scale)·V ref). The independent (batch, q-head) tasks now run across cores on rayon's global pool — heads share only read-only Q/K/V and write disjoint output slices. Disable with TRACT_FLASH_SDPA_ST=1; single-threaded on wasm. The op scales ~5x across an Apple M1 Pro's 6 performance cores (compute-bound, not memory- bound in that range). Sdpa::codegen gains a sequence-length heuristic: an f32 SDPA whose K/V length is below TRACT_FLASH_SDPA_MIN_SEQ_LEN lowers to the decomposed matmul+softmax path instead of FlashSdpaOp. Default 0 keeps flash for every length (with head parallelism it beat the decomposed path at every size measured, 128–4096); raise it on hosts where short-sequence decompose wins. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add six f16 element-wise activations on x86 AVX-512: sigmoid_f16, tanh_f16, hardswish_f16, leaky_relu_f16, silu_f16, gelu_f16. Each kernel chunks the input through a 64-byte-aligned f32 scratch (CHUNK=256), dispatches to the matching f32 AVX-512 kernel (the avx512_sigmoid_f32 / avx512_tanh_f32 wrappers, or the act:: hardswish / leaky_relu / silu / gelu kernels), and converts back to f16. silu and gelu compose sigmoid_f32 / tanh_f32 with the final combine done in f32. The f16 <-> f32 conversion is driven by vcvtph2ps / vcvtps2ph via std::arch intrinsics (cvt_f16_to_f32 / cvt_f32_to_f16 helpers); rustc + LLVM do not autovectorize the scalar f16::to_f32 / f16::from_f32 loops, which is why a naive port leaves AVX-512 stuck at ~7 Melem/s. Wires into Ops::{sigmoid,tanh,hardswish,leaky_relu,silu,gelu}_f16 from plug_avx512f; non-AVX512 x86 keeps the generic scalar f16 kernels. Validated against the generic H<Op>8 reference via the existing *_frame_tests! macros at SuperApproximate tolerance, which covers the precision delta between scalar f16 arithmetic and f32-internal computation. Measured on Cascade Lake (single-thread, throughput Gelem/s): - sigmoid_f16: 0.016 -> 1.54 (96x) - tanh_f16: 0.018 -> 1.61 (92x) - hardswish_f16: 0.051 -> 9.46 (186x) - leaky_relu_f16: 0.96 -> 10.4 (11x; generic baseline is unexpectedly fast) - silu_f16: 0.20 -> 0.93 (4.6x) - gelu_f16: 0.11 -> 0.75 (6.7x) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids / Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16 each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip, no vcvtph2ps/vcvtps2ph at the IO boundary. Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension). Bench on Sapphire Rapids (n=1024, single thread, Criterion): hardswish_f16: generic 52.3 Melem/s avx512_f32roundtrip 8.71 Gelem/s (current #8 path) avx512fp16_native 31.6 Gelem/s (this PR, 3.62× over the roundtrip) A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44 Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not saturate the FP16 execution port the same way the equivalent f32 ops saturate the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest against the f16 reference); kept in the source for future revisit on different fp16 uarchs where the comparison might flip. Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch `cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown (plug_avx512fp16 is x86_64-only and feature-gated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids / Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16 each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip, no vcvtph2ps/vcvtps2ph at the IO boundary. Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension). Bench on Sapphire Rapids (n=1024, single thread, Criterion): hardswish_f16: generic 52.3 Melem/s avx512_f32roundtrip 8.71 Gelem/s (current czoli1976#8 path) avx512fp16_native 31.6 Gelem/s (this PR, 3.62× over the roundtrip) A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44 Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not saturate the FP16 execution port the same way the equivalent f32 ops saturate the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest against the f16 reference); kept in the source for future revisit on different fp16 uarchs where the comparison might flip. Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch `cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown (plug_avx512fp16 is x86_64-only and feature-gated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

czoli1976 force-pushed the feat/avx512-activations-f16 branch from 68ff147 to a4cd56b Compare May 28, 2026 11:09

czoli1976 mentioned this pull request May 28, 2026

linalg/x86_64: AVX-512_FP16 native f16 hardswish kernel #10

Open

7 tasks

czoli1976 mentioned this pull request May 28, 2026

linalg/arm64: NEON-fp16 hardswish_f16, silu_f16, gelu_f16 kernels #12

Open

7 tasks

czoli1976 force-pushed the feat/avx512-activations branch from 55c21d0 to 7cb4bd7 Compare May 29, 2026 08:13

czoli1976 force-pushed the feat/avx512-activations-f16 branch from a4cd56b to a921dd9 Compare May 29, 2026 08:13

kali and others added 23 commits May 29, 2026 10:25

fmt

5ca8dab

setup deny for cli

2614eea

fmt

1d6d1aa

release 0.23.0-dev.6

9dcf880

post-release v0.23.0-pre

3ed1478

post-release 0.23.0-pre

4ffe58b

changelog

9398b06

release 0.23.0

bf90b4b

post-release 0.23.1-pre

688b476

Merge branch 'main' into feat/int8-sdot-kernel

a468525

kali and others added 18 commits June 2, 2026 09:50

Merge pull request sonos#2278 from czoli1976/feat/int8-sdot-kernel

1e5841f

arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing

onnx/gqa: simplify local_window_size clamp with .max(0) (review sonos…

14a8024

…#2327) Per @kali's review: the two-arm match was a convoluted way to clamp negatives to 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

no need to freeze in Runnable::run (sonos#2328)

4e8bd21

core/ops/fft: Stft::axes_mapping unlocks pulsification on non-STFT ax…

8398ec0

…es (sonos#2331)

pulse/ops/array: linearity-checked per-pulse size for MultiBroadcastT…

e68a35d

…o stream axis

docs: add CLAUDE.md with contributor rules

986ad7c

Curated behavioral subset of AGENTS.md (fmt, commit/comment style, model-edit tooling, public API, test placement, PR etiquette) inlined so it lands in the auto-loaded context; AGENTS.md stays the full reference.

kali force-pushed the feat/avx512-activations-f16 branch from a921dd9 to c696d59 Compare June 4, 2026 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/x86_64: add AVX-512 f16 element-wise activations#8

linalg/x86_64: add AVX-512 f16 element-wise activations#8
czoli1976 wants to merge 41 commits into
feat/avx512-activationsfrom
feat/avx512-activations-f16

czoli1976 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

czoli1976 commented May 28, 2026

Summary

Test plan

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants