linalg/x86_64: add AVX-512 f16 element-wise activations#8
Open
czoli1976 wants to merge 41 commits into
Open
Conversation
68ff147 to
a4cd56b
Compare
7 tasks
czoli1976
added a commit
that referenced
this pull request
May 28, 2026
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.
Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).
Bench on Sapphire Rapids (n=1024, single thread, Criterion):
hardswish_f16:
generic 52.3 Melem/s
avx512_f32roundtrip 8.71 Gelem/s (current #8 path)
avx512fp16_native 31.6 Gelem/s (this PR, 3.62× over the roundtrip)
A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.
Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
55c21d0 to
7cb4bd7
Compare
a4cd56b to
a921dd9
Compare
czoli1976
added a commit
that referenced
this pull request
May 29, 2026
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.
Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).
Bench on Sapphire Rapids (n=1024, single thread, Criterion):
hardswish_f16:
generic 52.3 Melem/s
avx512_f32roundtrip 8.71 Gelem/s (current #8 path)
avx512fp16_native 31.6 Gelem/s (this PR, 3.62× over the roundtrip)
A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.
Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
runtime_for_name("gpu") → first GPU backend whose `check()`
passes (metal, then cuda); error if
none are available.
runtime_for_name("gpu-or-cpu") → same lookup, but falls through to
the `default` CPU runtime instead
of erroring.
No new mechanism — both names walk the existing inventory and use each
backend's existing `check()` to decide availability. Backend-specific
names (`cuda`, `metal`) still work as before.
…or_name The CPU runtime now reports its own name as `cpu` (which is what it is), so `list-runtimes` shows `cpu`, `cuda`, `metal` … instead of the misleading `default`. Back-compat for callers passing `default` is handled by a one-line alias in `runtime_for_name` rather than by registering two runtimes or by polluting the trait — the alias only affects name lookup, not the inventory.
The `tensorflow` 0.21.0 crate (Rust binding for libtensorflow) was only pulled in behind the dead `conform` cargo feature — which gated `tract compare --tf` (compare tract output against running on libtensorflow on the same model). The feature isn't enabled in any GitHub workflow; only a stranded `.travis/tf.sh` ever ran it. The upstream `tensorflow` crate hasn't shipped since 2023-08-15 and pins to rust-protobuf 2.27.x, which trips RUSTSEC-2024-0437. Drop the feature and all its plumbing. Tract's own `.pb` parsing (used by `-t transformers_detect_all` and the `tf` cargo feature in tract-cli) goes through prost and is unaffected — the `tract-tensorflow` crate stays, just without the libtensorflow runtime. Cargo.lock shrinks by ~350 lines as a side-effect.
The LayerNorm op's `wire` expansion casts `normalized` back to
fact.datum_type *before* applying scale/bias, then multiplies that
result with `cast_scale` (which is still in self.datum_type, F32).
With F16 inputs this becomes F16 × F32, whose output is downgraded to
F32 by `mul()`. The inference rule then asserts
`outputs[0].datum_type == inputs[0].datum_type` (F16) against the
actual F32 output, failing `into_typed()` with:
Output mismatch after rewiring expansion for output #0:
expected 1,256,384,F16 got 1,256,384,F32
Fix: defer the cast back to fact.datum_type until after all scale/bias
operations. Now the expansion stays entirely in self.datum_type (F32)
through normalized × scale + bias, and casts only the final result.
Behavior is unchanged for F32 inputs (the final cast is a no-op when
fact.datum_type == self.datum_type).
Reproduced with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
exported via `optimum.exporters.onnx.main_export(..., dtype="fp16")`
and loaded with `into_optimized().into_runnable()`.
The single-thread MMM tile walk used a naive nested loop, re-streaming the full inner operand (all of A in col-outer / B in row-outer) per panel at large k, which is memory/L1-bound. The multithread path already 2D-blocks the panel grid (chunk_grid); this brings the same blocking to the single-thread path, with the block edge cache-derived (detected L2/3, conservative 256 KiB fallback) so it stays L2-resident across hardware and never over-blocks a cache it cannot see. Bit-identical: it only reorders independent tiles (each computes its full-k reduction into a disjoint C region). The block-edge floor of 1 degrades exactly to the naive loop; the cap of 16 matches the multithread chunk_grid blocking already shipped on all platforms. Frame-level, so all kernels benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV / multithreaded shapes are unchanged. Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against the naive reference (the existing frame proptests only reach 3 panels). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant (it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt --check` failed in CI. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Passing 'v0.24.0' to post-release.sh writes 'version = "v0.24.0"' into every Cargo.toml — invalid semver, breaks the workspace, easy to do by muscle memory because the git tag does carry the 'v' prefix. Bail out early in both release.sh and post-release.sh when the argument doesn't match an unprefixed semver.
actions/checkout runs with persist-credentials: false, so the bare 'git push origin gh-pages' had no auth and failed with 'could not read Username'. Use the workflow's GITHUB_TOKEN in the remote URL instead — keeps zizmor happy while letting the deploy step push.
arm64simd_mmm_i32_8x8_dot: an int8->i32 8x8 matmul kernel using SDOT (FEAT_DotProd, ARMv8.2), ~4x the SMLAL 8x8 at the matmul level. Same v16..v31 tile layout as the SMLAL 8x8, so it reuses the existing i32 fuse/store/q_scale machinery, and consumes the K=4-inner PackedI8K4 packing now upstream (sonos#2281). - Gated on has_dotprod() (Apple M1+/A11+; Linux HWCAP_ASIMDDP). TRACT_DOTPROD_DISABLE=1 forces the SMLAL 8x8 fallback so callers can A/B on one binary. - Wired into qmmm_i32: int8 matmul/conv pick SDOT when FEAT_DotProd is present, SMLAL 8x8 otherwise. Relies on the merged dispatch fix (sonos#2277) to route 2D int8 matmuls to a matrix kernel instead of the 64x1 GEMV. - Adds linalg/benches/qmmm_i8.rs (SDOT vs SMLAL microbench). The kernel is compiled in a separate cc::Build step gated on a build.rs assembler probe (assembler_supports_dotprod). Old assemblers such as binutils 2.28 on Debian stretch cannot encode `sdot` and fail the probe; the `tract_arm64_dotprod` cfg is then not set, the kernel is omitted, and dispatch falls back to the SMLAL 8x8 i32 kernel. Follows the same pattern as the existing assembler_supports_sme probe. Bit-exact vs the SMLAL kernel: linalg 114/114 (i8i8 + i32i32 fuse/frame + q_scale), core int8 matmul 25/25. Apple M4 e2e (kernel unchanged from the original PR): MiniLM 44.4->24.8 ms (1.79x), InceptionV1 51.6->28.4 ms (1.82x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…8x8) Route qmmm_i32 through VPDPBUSD when AVX-512 VNNI is available, replacing the AVX2 per-K widening-multiply inner loop. Consumes the existing K=4-inner PackedI8K4 layout; A is offset by +128 for VPDPBUSD's u8*s8 form and the 128*sum_k(B) bias is removed per output column, so the i32 accumulators stay bit-identical to the AVX2 path and the whole quantization epilogue is reused. Runtime-gated via where(AVX512VNNI); non-VNNI x86 keeps the AVX2 fallback. Includes a vnni_i32 microbench (VNNI vs AVX2 int8). The kernel lives in its own x86_64/avx512vnni/ subdirectory and is compiled in a separate cc::Build step gated on a build.rs assembler probe (assembler_supports_avx512vnni). Old assemblers such as binutils 2.28 on Debian stretch cannot encode `vpdpbusd ymm` and will fail the probe; the `tract_avx512vnni` cfg is then not set and the kernel is omitted entirely, with dispatch falling back to the AVX2 i32 path. Follows the same pattern as the existing SME (assembler_supports_sme) and SVE (compiler_supports_sve) probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an ONNX LSTM/GRU/RNN exposes its full recurrent state both as input and output (initial_h + Y_h, plus initial_c + Y_c for LSTM), the caller manages state across calls. Set Scan::external_state in that case so the existing declutter_single_loop pass can inline a single-iteration Scan (seq_len == 1) — the streaming / autoregressive-decoder regime where the one-iteration Scan is pure orchestration overhead. Previously external_state was only reachable via the manual force_scan_external_state transform, so streaming RNNs carried a dead Scan on every call. Inlining is sound here because the body's State input is fed from the outer (caller-supplied) input each call (see issue sonos#2157). Measured on DTLN (an LSTM-heavy streaming denoiser): -8% end-to-end, output unchanged at 110.47 dB / Pearson 1.00000 vs the native reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t flag The previous importer heuristic set external_state whenever a GRU/LSTM node carried initial_h/Y_h (and initial_c/Y_c). That mis-fires: DFN3's GRU nodes also carry initial_h/Y_h, but their state is carried internally by tract under pulse, not by the caller — so inlining the single-iteration Scan would break it (the 0.23 regression kali flagged). Move the decision into declutter_single_loop, which has the whole graph: inline a single-iteration Scan only when every recurrent state has a last-value output that reaches a model output, i.e. the caller can observe the updated state and thread it back. Adds outlet_reaches_model_output. DTLN (state feeds a model output) still inlines, output unchanged at 110.47 dB / Pearson 1.00000. DFN3 df_dec (Y_h reaches no model output, only coefs) is not inlined. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
De-orphan and fix the latent zmm sigmoid/tanh kernels (tail-loop stride bugs causing OOB stores for lengths not a multiple of 64), and add AVX-512 hardswish, leaky_relu, plus silu/gelu as compositions over the AVX-512 sigmoid/tanh. Runtime-gated on avx512f; non-AVX512 x86 keeps the FMA/generic path. Measured on Cascade Lake (single-thread): sigmoid 1.24x and tanh 1.29x over the existing FMA paths; hardswish/leaky_relu/silu/gelu 5-21x over the generic scalar paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing
For models trained with sliding-window attention (Mistral, Gemma-style local/global): a fixed-capacity ring buffer that overwrites the oldest slot on append, so decode runs at CONSTANT memory + per-step cost regardless of context length, losslessly (the model is trained to attend only within the window). Cheap because decode attention is ORDER-INVARIANT over keys (O = Σ softmax_j·V_j is unchanged under a (K,V) permutation), so the ring buffer never needs un-rotation — the consumer attends over the W physical slots as-is. Validated: holds the last-W as a set (incl. prefill chunk > window); windowed attention == ordered last-W attention (close, float summation order); memory bounded at W. Companion to the in-place cache (sonos#2321) = 'in-place cache with a cap + wraparound'. 3 tests, fmt+clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The com.microsoft GQA op carries the sliding window as a local_window_size attribute (verified: real Mistral-v0.1 exports encode it this way, all layers =4096) — not as an explicit mask. tract was rejecting it outright, so those models failed to import. Accept it and apply it as a banded causal mask in the existing concrete-shape mask path: query i attends to key j iff j <= i AND i - j < window. window=0 stays plain causal. Symbolic seq lengths bail (a static band can't be built; is_causal would silently widen to full attention). windowed_causal_mask helper + unit test (band correctness). Pairs with the bounded ring-buffer KV cache for the decode-side efficiency (separate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stateful fused op owning K/V sliding-window ring buffers: each decode step appends K_new/V_new and attends Q over the (<=window) bounded cache. The bounded cache IS the sliding window, so attending over it equals windowed attention (causal=false: every cached key is within the current query's window) -> constant memory + per-step cost, losslessly. Op/EvalOp/TypedOp + OpState + freeze/unfreeze. Validated: window_sdpa_decode_matches_last_w_in_model builds a real TypedModel, runs it through tract's engine (into_runnable/spawn/run) over 15 decode steps with window=5, and matches full attention over the last-W each step, cache bounded to W throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
WindowKvSdpaTransform { window }: strips the GQA broadcast chain then fuses
{DynKeyValueCache(K), DynKeyValueCache(V), Sdpa} -> WindowKvSdpa{window} (window threaded
via the Rewriter context), so an imported decode model uses the bounded sliding-window
cache. The window comes from the model (GQA local_window_size / config), supplied to the
transform. Validated: transform_fuses_cache_sdpa_to_windowed_decode (caches+Sdpa removed,
rewritten model does correct windowed decode vs full-attn-over-last-W).
NNEF ser/de (tract_transformers_window_kv_sdpa, registered) + round-trip test.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirrors full.yml + cross-platform.yml. The workflow already runs on `pull_request:`; this adds the manual dispatch path for testing a specific PR (including from a fork) from the Actions tab. A new `prepare` job derives `test_ref` from the `pr_number` input or from the triggering pull_request, and every job now checks out at that ref.
…os#2332) The transform conflated two concerns: flipping external_state (a fixup for NNEF artifacts predating the flag) and substituting the scan-axis symbol with 1 model-wide. The latter is the caller's per-call seq=1 contract, needed only for declutter_single_loop's separate iters==1 gate, not implied by external state. Keep the transform flag-only; harnesses now drive inlining with an explicit -t set_symbols (T / TARGETS__TIME = 1) alongside the flag.
…omment (sonos#2334) Supersedes dependabot sonos#2333. Pins the v6.1.2 commit (acca2b1b) and replaces the floating '# v6' comment with '# v6.1.2'. zizmor's unpinned-uses flags a hash pin whose comment tag no longer resolves to the pinned SHA; a major-only tag drifts on every patch release, so use the exact version tag (immutable, matches the SHA).
* build: sync Cargo.lock workspace versions to 0.23.1-pre post-release 0.23.1-pre (688b476) bumped the crate manifests but left Cargo.lock at 0.23.0, so every cargo build rewrote these 24 workspace-member version entries and showed a spurious modified Cargo.lock. * release: sync Cargo.lock in post-release version bump post-release.sh edited manifests via tomato (which never invokes cargo) and committed without regenerating Cargo.lock, shipping a lock that mismatched the bumped versions. Add 'cargo update --workspace' before the commit. release.sh avoids this incidentally because cargo publish re-resolves the lock first.
com.microsoft.RotaryEmbedding is identical math to the standardized ai.onnx op but orders its inputs (input, position_ids, cos, sin). tract resolves ops by name regardless of domain, so make the single handler domain-aware and remap inputs accordingly. Rejects the contrib-only scale != 1.0 and is_packed_batching attributes with clear errors. Verified bit-exact against onnxruntime (3D, 4D, interleaved); ai.onnx RotaryEmbedding conformance unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Curated behavioral subset of AGENTS.md (fmt, commit/comment style, model-edit tooling, public API, test placement, PR etiquette) inlined so it lands in the auto-loaded context; AGENTS.md stays the full reference.
The single "default to none" rule was suppressing doc comments along with inline narration. Scope the austerity to inline comments and add a doc-comment section that encourages concise item docs (contract, inputs, rule interactions) while keeping the same no-benchmarks/no-history rule.
…c + seq-len lowering heuristic P·V is computed as one contiguous tile GEMM (`s.dot(&vblock)`) instead of `head_dim` strided per-column dots; the strided column access defeated vectorization. Bit-exact (max_abs = 0 vs a naive softmax(QKᵀ·scale)·V ref). The independent (batch, q-head) tasks now run across cores on rayon's global pool — heads share only read-only Q/K/V and write disjoint output slices. Disable with TRACT_FLASH_SDPA_ST=1; single-threaded on wasm. The op scales ~5x across an Apple M1 Pro's 6 performance cores (compute-bound, not memory- bound in that range). Sdpa::codegen gains a sequence-length heuristic: an f32 SDPA whose K/V length is below TRACT_FLASH_SDPA_MIN_SEQ_LEN lowers to the decomposed matmul+softmax path instead of FlashSdpaOp. Default 0 keeps flash for every length (with head parallelism it beat the decomposed path at every size measured, 128–4096); raise it on hosts where short-sequence decompose wins. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add six f16 element-wise activations on x86 AVX-512: sigmoid_f16, tanh_f16,
hardswish_f16, leaky_relu_f16, silu_f16, gelu_f16. Each kernel chunks the
input through a 64-byte-aligned f32 scratch (CHUNK=256), dispatches to the
matching f32 AVX-512 kernel (the avx512_sigmoid_f32 / avx512_tanh_f32
wrappers, or the act:: hardswish / leaky_relu / silu / gelu kernels), and
converts back to f16. silu and gelu compose sigmoid_f32 / tanh_f32 with the
final combine done in f32.
The f16 <-> f32 conversion is driven by vcvtph2ps / vcvtps2ph via std::arch
intrinsics (cvt_f16_to_f32 / cvt_f32_to_f16 helpers); rustc + LLVM do not
autovectorize the scalar f16::to_f32 / f16::from_f32 loops, which is why a
naive port leaves AVX-512 stuck at ~7 Melem/s.
Wires into Ops::{sigmoid,tanh,hardswish,leaky_relu,silu,gelu}_f16 from
plug_avx512f; non-AVX512 x86 keeps the generic scalar f16 kernels. Validated
against the generic H<Op>8 reference via the existing *_frame_tests! macros
at SuperApproximate tolerance, which covers the precision delta between
scalar f16 arithmetic and f32-internal computation.
Measured on Cascade Lake (single-thread, throughput Gelem/s):
- sigmoid_f16: 0.016 -> 1.54 (96x)
- tanh_f16: 0.018 -> 1.61 (92x)
- hardswish_f16: 0.051 -> 9.46 (186x)
- leaky_relu_f16: 0.96 -> 10.4 (11x; generic baseline is unexpectedly fast)
- silu_f16: 0.20 -> 0.93 (4.6x)
- gelu_f16: 0.11 -> 0.75 (6.7x)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a921dd9 to
c696d59
Compare
kali
pushed a commit
that referenced
this pull request
Jun 8, 2026
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.
Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).
Bench on Sapphire Rapids (n=1024, single thread, Criterion):
hardswish_f16:
generic 52.3 Melem/s
avx512_f32roundtrip 8.71 Gelem/s (current #8 path)
avx512fp16_native 31.6 Gelem/s (this PR, 3.62× over the roundtrip)
A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.
Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kali
pushed a commit
to sonos/tract
that referenced
this pull request
Jun 8, 2026
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.
Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).
Bench on Sapphire Rapids (n=1024, single thread, Criterion):
hardswish_f16:
generic 52.3 Melem/s
avx512_f32roundtrip 8.71 Gelem/s (current czoli1976#8 path)
avx512fp16_native 31.6 Gelem/s (this PR, 3.62× over the roundtrip)
A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.
Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on PR #4 (
feat/avx512-activations). Adds six f16 element-wise activations on x86 AVX-512:sigmoid_f16,tanh_f16(compose over PR linalg/x86_64: add AVX-512 element-wise activations #4'savx512_sigmoid_f32/avx512_tanh_f32)hardswish_f16,leaky_relu_f16(compose over PR linalg/x86_64: add AVX-512 element-wise activations #4's f32 kernels of the same op)silu_f16,gelu_f16(compose at the kernel level: f32 polynomial + scalar combine in f32, mirroring PR linalg/x86_64: add AVX-512 element-wise activations #4's f32 versions)Each kernel chunks the input through a 64-byte-aligned f32 scratch (CHUNK = 256), runs the matching f32 AVX-512 kernel, and converts back to f16. f16↔f32 conversion is driven by
vcvtph2ps/vcvtps2phviastd::archintrinsics (helperscvt_f16_to_f32/cvt_f32_to_f16) — rustc + LLVM do NOT autovectorize the scalarf16::to_f32/f16::from_f32loops, which is why a naive port leaves AVX-512 stuck at ~7 Melem/s. The intrinsics-based path gets back to the per-op AVX-512 ceiling.Wires into
Ops::{sigmoid,tanh,hardswish,leaky_relu,silu,gelu}_f16fromplug_avx512f; non-AVX512 x86 keeps the generic scalar f16 kernels.Bench (local, single-thread, Cascade Lake, throughput Gelem/s):
leaky_reluandsilushow smaller ratios because the generic baseline is already faster than the sigmoid/tanh polynomial paths (leaky_relu is just a max + multiply; silu uses a generic sigmoid that's faster than HSigmoid8 in isolation).Test plan
cargo test --release -p tract-linalg— 2708 passed, 0 failed (21 new f16 activation tests: 3-7 cases per op, including proptest)cargo bench --bench activations_avx512_f16— numbers aboveavx512fgating inplug_avx512f)Dependencies
This PR is stacked on PR #4 (
feat/avx512-activations). It importsact::x86_64_avx512_hardswish_f32_64n,act::x86_64_avx512_leaky_relu_f32_64n,avx512_sigmoid_f32,avx512_tanh_f32. If PR #4 lands upstream first, this PR rebases trivially. If shipping standalone, it would also need PR #4's content.Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Generated by Claude Code