Skip to content

linalg/x86_64: add AVX-512 f16 element-wise activations#8

Open
czoli1976 wants to merge 41 commits into
feat/avx512-activationsfrom
feat/avx512-activations-f16
Open

linalg/x86_64: add AVX-512 f16 element-wise activations#8
czoli1976 wants to merge 41 commits into
feat/avx512-activationsfrom
feat/avx512-activations-f16

Conversation

@czoli1976

Copy link
Copy Markdown
Owner

Summary

Stacked on PR #4 (feat/avx512-activations). Adds six f16 element-wise activations on x86 AVX-512:

Each kernel chunks the input through a 64-byte-aligned f32 scratch (CHUNK = 256), runs the matching f32 AVX-512 kernel, and converts back to f16. f16↔f32 conversion is driven by vcvtph2ps / vcvtps2ph via std::arch intrinsics (helpers cvt_f16_to_f32 / cvt_f32_to_f16) — rustc + LLVM do NOT autovectorize the scalar f16::to_f32 / f16::from_f32 loops, which is why a naive port leaves AVX-512 stuck at ~7 Melem/s. The intrinsics-based path gets back to the per-op AVX-512 ceiling.

Wires into Ops::{sigmoid,tanh,hardswish,leaky_relu,silu,gelu}_f16 from plug_avx512f; non-AVX512 x86 keeps the generic scalar f16 kernels.

Bench (local, single-thread, Cascade Lake, throughput Gelem/s):

op generic AVX-512 speedup
sigmoid_f16 0.016 1.54 96×
tanh_f16 0.018 1.61 92×
hardswish_f16 0.051 9.46 186×
leaky_relu_f16 0.96 10.4 11×
silu_f16 0.20 0.93 4.6×
gelu_f16 0.11 0.75 6.7×

leaky_relu and silu show smaller ratios because the generic baseline is already faster than the sigmoid/tanh polynomial paths (leaky_relu is just a max + multiply; silu uses a generic sigmoid that's faster than HSigmoid8 in isolation).

Test plan

  • cargo test --release -p tract-linalg — 2708 passed, 0 failed (21 new f16 activation tests: 3-7 cases per op, including proptest)
  • cargo bench --bench activations_avx512_f16 — numbers above
  • Non-AVX512 x86 hosts unchanged (fallback exercised via avx512f gating in plug_avx512f)

Dependencies

This PR is stacked on PR #4 (feat/avx512-activations). It imports act::x86_64_avx512_hardswish_f32_64n, act::x86_64_avx512_leaky_relu_f32_64n, avx512_sigmoid_f32, avx512_tanh_f32. If PR #4 lands upstream first, this PR rebases trivially. If shipping standalone, it would also need PR #4's content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Generated by Claude Code

@czoli1976 czoli1976 force-pushed the feat/avx512-activations-f16 branch from 68ff147 to a4cd56b Compare May 28, 2026 11:09
czoli1976 added a commit that referenced this pull request May 28, 2026
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.

Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).

Bench on Sapphire Rapids (n=1024, single thread, Criterion):
  hardswish_f16:
    generic              52.3 Melem/s
    avx512_f32roundtrip   8.71 Gelem/s   (current #8 path)
    avx512fp16_native    31.6 Gelem/s   (this PR, 3.62× over the roundtrip)

A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.

Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976 czoli1976 force-pushed the feat/avx512-activations branch from 55c21d0 to 7cb4bd7 Compare May 29, 2026 08:13
@czoli1976 czoli1976 force-pushed the feat/avx512-activations-f16 branch from a4cd56b to a921dd9 Compare May 29, 2026 08:13
czoli1976 added a commit that referenced this pull request May 29, 2026
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.

Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).

Bench on Sapphire Rapids (n=1024, single thread, Criterion):
  hardswish_f16:
    generic              52.3 Melem/s
    avx512_f32roundtrip   8.71 Gelem/s   (current #8 path)
    avx512fp16_native    31.6 Gelem/s   (this PR, 3.62× over the roundtrip)

A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.

Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kali and others added 23 commits May 29, 2026 10:25
runtime_for_name("gpu")        → first GPU backend whose `check()`
                                 passes (metal, then cuda); error if
                                 none are available.
runtime_for_name("gpu-or-cpu") → same lookup, but falls through to
                                 the `default` CPU runtime instead
                                 of erroring.

No new mechanism — both names walk the existing inventory and use each
backend's existing `check()` to decide availability.  Backend-specific
names (`cuda`, `metal`) still work as before.
…or_name

The CPU runtime now reports its own name as `cpu` (which is what it
is), so `list-runtimes` shows `cpu`, `cuda`, `metal` … instead of
the misleading `default`.

Back-compat for callers passing `default` is handled by a one-line
alias in `runtime_for_name` rather than by registering two runtimes
or by polluting the trait — the alias only affects name lookup, not
the inventory.
The `tensorflow` 0.21.0 crate (Rust binding for libtensorflow) was
only pulled in behind the dead `conform` cargo feature — which gated
`tract compare --tf` (compare tract output against running on
libtensorflow on the same model).  The feature isn't enabled in any
GitHub workflow; only a stranded `.travis/tf.sh` ever ran it.

The upstream `tensorflow` crate hasn't shipped since 2023-08-15 and
pins to rust-protobuf 2.27.x, which trips RUSTSEC-2024-0437.  Drop
the feature and all its plumbing.

Tract's own `.pb` parsing (used by `-t transformers_detect_all` and
the `tf` cargo feature in tract-cli) goes through prost and is
unaffected — the `tract-tensorflow` crate stays, just without the
libtensorflow runtime.  Cargo.lock shrinks by ~350 lines as a
side-effect.
The LayerNorm op's `wire` expansion casts `normalized` back to
fact.datum_type *before* applying scale/bias, then multiplies that
result with `cast_scale` (which is still in self.datum_type, F32).

With F16 inputs this becomes F16 × F32, whose output is downgraded to
F32 by `mul()`. The inference rule then asserts
`outputs[0].datum_type == inputs[0].datum_type` (F16) against the
actual F32 output, failing `into_typed()` with:

    Output mismatch after rewiring expansion for output #0:
    expected 1,256,384,F16 got 1,256,384,F32

Fix: defer the cast back to fact.datum_type until after all scale/bias
operations. Now the expansion stays entirely in self.datum_type (F32)
through normalized × scale + bias, and casts only the final result.

Behavior is unchanged for F32 inputs (the final cast is a no-op when
fact.datum_type == self.datum_type).

Reproduced with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
exported via `optimum.exporters.onnx.main_export(..., dtype="fp16")`
and loaded with `into_optimized().into_runnable()`.
The single-thread MMM tile walk used a naive nested loop, re-streaming the
full inner operand (all of A in col-outer / B in row-outer) per panel at
large k, which is memory/L1-bound. The multithread path already 2D-blocks the
panel grid (chunk_grid); this brings the same blocking to the single-thread
path, with the block edge cache-derived (detected L2/3, conservative 256 KiB
fallback) so it stays L2-resident across hardware and never over-blocks a
cache it cannot see.

Bit-identical: it only reorders independent tiles (each computes its full-k
reduction into a disjoint C region). The block-edge floor of 1 degrades
exactly to the naive loop; the cap of 16 matches the multithread chunk_grid
blocking already shipped on all platforms. Frame-level, so all kernels
benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV /
multithreaded shapes are unchanged.

Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against
the naive reference (the existing frame proptests only reach 3 panels).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant
(it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt
--check` failed in CI. Pure formatting; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Passing 'v0.24.0' to post-release.sh writes 'version = "v0.24.0"' into
every Cargo.toml — invalid semver, breaks the workspace, easy to do by
muscle memory because the git tag does carry the 'v' prefix. Bail out
early in both release.sh and post-release.sh when the argument doesn't
match an unprefixed semver.
actions/checkout runs with persist-credentials: false, so the bare
'git push origin gh-pages' had no auth and failed with 'could not
read Username'. Use the workflow's GITHUB_TOKEN in the remote URL
instead — keeps zizmor happy while letting the deploy step push.
arm64simd_mmm_i32_8x8_dot: an int8->i32 8x8 matmul kernel using SDOT
(FEAT_DotProd, ARMv8.2), ~4x the SMLAL 8x8 at the matmul level. Same v16..v31
tile layout as the SMLAL 8x8, so it reuses the existing i32 fuse/store/q_scale
machinery, and consumes the K=4-inner PackedI8K4 packing now upstream (sonos#2281).

- Gated on has_dotprod() (Apple M1+/A11+; Linux HWCAP_ASIMDDP). TRACT_DOTPROD_DISABLE=1
  forces the SMLAL 8x8 fallback so callers can A/B on one binary.
- Wired into qmmm_i32: int8 matmul/conv pick SDOT when FEAT_DotProd is present,
  SMLAL 8x8 otherwise. Relies on the merged dispatch fix (sonos#2277) to route 2D int8
  matmuls to a matrix kernel instead of the 64x1 GEMV.
- Adds linalg/benches/qmmm_i8.rs (SDOT vs SMLAL microbench).

The kernel is compiled in a separate cc::Build step gated on a build.rs assembler
probe (assembler_supports_dotprod). Old assemblers such as binutils 2.28 on Debian
stretch cannot encode `sdot` and fail the probe; the `tract_arm64_dotprod` cfg is
then not set, the kernel is omitted, and dispatch falls back to the SMLAL 8x8 i32
kernel. Follows the same pattern as the existing assembler_supports_sme probe.

Bit-exact vs the SMLAL kernel: linalg 114/114 (i8i8 + i32i32 fuse/frame + q_scale),
core int8 matmul 25/25. Apple M4 e2e (kernel unchanged from the original PR):
MiniLM 44.4->24.8 ms (1.79x), InceptionV1 51.6->28.4 ms (1.82x).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…8x8)

Route qmmm_i32 through VPDPBUSD when AVX-512 VNNI is available, replacing the
AVX2 per-K widening-multiply inner loop. Consumes the existing K=4-inner
PackedI8K4 layout; A is offset by +128 for VPDPBUSD's u8*s8 form and the
128*sum_k(B) bias is removed per output column, so the i32 accumulators stay
bit-identical to the AVX2 path and the whole quantization epilogue is reused.

Runtime-gated via where(AVX512VNNI); non-VNNI x86 keeps the AVX2 fallback.
Includes a vnni_i32 microbench (VNNI vs AVX2 int8).

The kernel lives in its own x86_64/avx512vnni/ subdirectory and is compiled
in a separate cc::Build step gated on a build.rs assembler probe
(assembler_supports_avx512vnni). Old assemblers such as binutils 2.28 on
Debian stretch cannot encode `vpdpbusd ymm` and will fail the probe; the
`tract_avx512vnni` cfg is then not set and the kernel is omitted entirely,
with dispatch falling back to the AVX2 i32 path. Follows the same pattern as
the existing SME (assembler_supports_sme) and SVE (compiler_supports_sve)
probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an ONNX LSTM/GRU/RNN exposes its full recurrent state both as input
and output (initial_h + Y_h, plus initial_c + Y_c for LSTM), the caller
manages state across calls. Set Scan::external_state in that case so the
existing declutter_single_loop pass can inline a single-iteration Scan
(seq_len == 1) — the streaming / autoregressive-decoder regime where the
one-iteration Scan is pure orchestration overhead.

Previously external_state was only reachable via the manual
force_scan_external_state transform, so streaming RNNs carried a dead Scan
on every call. Inlining is sound here because the body's State input is fed
from the outer (caller-supplied) input each call (see issue sonos#2157).

Measured on DTLN (an LSTM-heavy streaming denoiser): -8% end-to-end, output
unchanged at 110.47 dB / Pearson 1.00000 vs the native reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t flag

The previous importer heuristic set external_state whenever a GRU/LSTM node
carried initial_h/Y_h (and initial_c/Y_c). That mis-fires: DFN3's GRU nodes
also carry initial_h/Y_h, but their state is carried internally by tract under
pulse, not by the caller — so inlining the single-iteration Scan would break
it (the 0.23 regression kali flagged).

Move the decision into declutter_single_loop, which has the whole graph:
inline a single-iteration Scan only when every recurrent state has a
last-value output that reaches a model output, i.e. the caller can observe the
updated state and thread it back. Adds outlet_reaches_model_output.

DTLN (state feeds a model output) still inlines, output unchanged at
110.47 dB / Pearson 1.00000. DFN3 df_dec (Y_h reaches no model output, only
coefs) is not inlined.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
De-orphan and fix the latent zmm sigmoid/tanh kernels (tail-loop stride bugs
causing OOB stores for lengths not a multiple of 64), and add AVX-512
hardswish, leaky_relu, plus silu/gelu as compositions over the AVX-512
sigmoid/tanh. Runtime-gated on avx512f; non-AVX512 x86 keeps the FMA/generic
path.

Measured on Cascade Lake (single-thread): sigmoid 1.24x and tanh 1.29x over
the existing FMA paths; hardswish/leaky_relu/silu/gelu 5-21x over the generic
scalar paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kali and others added 18 commits June 2, 2026 09:50
arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing
For models trained with sliding-window attention (Mistral, Gemma-style local/global):
a fixed-capacity ring buffer that overwrites the oldest slot on append, so decode runs
at CONSTANT memory + per-step cost regardless of context length, losslessly (the model
is trained to attend only within the window).

Cheap because decode attention is ORDER-INVARIANT over keys (O = Σ softmax_j·V_j is
unchanged under a (K,V) permutation), so the ring buffer never needs un-rotation — the
consumer attends over the W physical slots as-is. Validated: holds the last-W as a set
(incl. prefill chunk > window); windowed attention == ordered last-W attention (close,
float summation order); memory bounded at W. Companion to the in-place cache (sonos#2321) =
'in-place cache with a cap + wraparound'. 3 tests, fmt+clippy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The com.microsoft GQA op carries the sliding window as a local_window_size attribute
(verified: real Mistral-v0.1 exports encode it this way, all layers =4096) — not as an
explicit mask. tract was rejecting it outright, so those models failed to import.

Accept it and apply it as a banded causal mask in the existing concrete-shape mask path:
query i attends to key j iff j <= i AND i - j < window. window=0 stays plain causal.
Symbolic seq lengths bail (a static band can't be built; is_causal would silently widen
to full attention). windowed_causal_mask helper + unit test (band correctness).

Pairs with the bounded ring-buffer KV cache for the decode-side efficiency (separate).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stateful fused op owning K/V sliding-window ring buffers: each decode step appends
K_new/V_new and attends Q over the (<=window) bounded cache. The bounded cache IS the
sliding window, so attending over it equals windowed attention (causal=false: every
cached key is within the current query's window) -> constant memory + per-step cost,
losslessly. Op/EvalOp/TypedOp + OpState + freeze/unfreeze.

Validated: window_sdpa_decode_matches_last_w_in_model builds a real TypedModel, runs it
through tract's engine (into_runnable/spawn/run) over 15 decode steps with window=5, and
matches full attention over the last-W each step, cache bounded to W throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
WindowKvSdpaTransform { window }: strips the GQA broadcast chain then fuses
{DynKeyValueCache(K), DynKeyValueCache(V), Sdpa} -> WindowKvSdpa{window} (window threaded
via the Rewriter context), so an imported decode model uses the bounded sliding-window
cache. The window comes from the model (GQA local_window_size / config), supplied to the
transform. Validated: transform_fuses_cache_sdpa_to_windowed_decode (caches+Sdpa removed,
rewritten model does correct windowed decode vs full-attn-over-last-W).

NNEF ser/de (tract_transformers_window_kv_sdpa, registered) + round-trip test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#2327)

Per @kali's review: the two-arm match was a convoluted way to clamp negatives to 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirrors full.yml + cross-platform.yml. The workflow already runs on
`pull_request:`; this adds the manual dispatch path for testing a
specific PR (including from a fork) from the Actions tab. A new
`prepare` job derives `test_ref` from the `pr_number` input or
from the triggering pull_request, and every job now checks out at
that ref.
…os#2332)

The transform conflated two concerns: flipping external_state (a fixup for
NNEF artifacts predating the flag) and substituting the scan-axis symbol with
1 model-wide. The latter is the caller's per-call seq=1 contract, needed only
for declutter_single_loop's separate iters==1 gate, not implied by external
state. Keep the transform flag-only; harnesses now drive inlining with an
explicit -t set_symbols (T / TARGETS__TIME = 1) alongside the flag.
…omment (sonos#2334)

Supersedes dependabot sonos#2333. Pins the v6.1.2 commit (acca2b1b) and replaces
the floating '# v6' comment with '# v6.1.2'. zizmor's unpinned-uses flags a
hash pin whose comment tag no longer resolves to the pinned SHA; a major-only
tag drifts on every patch release, so use the exact version tag (immutable,
matches the SHA).
* build: sync Cargo.lock workspace versions to 0.23.1-pre

post-release 0.23.1-pre (688b476) bumped the crate manifests but left
Cargo.lock at 0.23.0, so every cargo build rewrote these 24 workspace-member
version entries and showed a spurious modified Cargo.lock.

* release: sync Cargo.lock in post-release version bump

post-release.sh edited manifests via tomato (which never invokes cargo) and
committed without regenerating Cargo.lock, shipping a lock that mismatched the
bumped versions. Add 'cargo update --workspace' before the commit. release.sh
avoids this incidentally because cargo publish re-resolves the lock first.
com.microsoft.RotaryEmbedding is identical math to the standardized
ai.onnx op but orders its inputs (input, position_ids, cos, sin). tract
resolves ops by name regardless of domain, so make the single handler
domain-aware and remap inputs accordingly. Rejects the contrib-only
scale != 1.0 and is_packed_batching attributes with clear errors.

Verified bit-exact against onnxruntime (3D, 4D, interleaved); ai.onnx
RotaryEmbedding conformance unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Curated behavioral subset of AGENTS.md (fmt, commit/comment style, model-edit
tooling, public API, test placement, PR etiquette) inlined so it lands in the
auto-loaded context; AGENTS.md stays the full reference.
The single "default to none" rule was suppressing doc comments along with
inline narration. Scope the austerity to inline comments and add a doc-comment
section that encourages concise item docs (contract, inputs, rule interactions)
while keeping the same no-benchmarks/no-history rule.
…c + seq-len lowering heuristic

P·V is computed as one contiguous tile GEMM (`s.dot(&vblock)`) instead of
`head_dim` strided per-column dots; the strided column access defeated
vectorization. Bit-exact (max_abs = 0 vs a naive softmax(QKᵀ·scale)·V ref).

The independent (batch, q-head) tasks now run across cores on rayon's global
pool — heads share only read-only Q/K/V and write disjoint output slices.
Disable with TRACT_FLASH_SDPA_ST=1; single-threaded on wasm. The op scales
~5x across an Apple M1 Pro's 6 performance cores (compute-bound, not memory-
bound in that range).

Sdpa::codegen gains a sequence-length heuristic: an f32 SDPA whose K/V length
is below TRACT_FLASH_SDPA_MIN_SEQ_LEN lowers to the decomposed matmul+softmax
path instead of FlashSdpaOp. Default 0 keeps flash for every length (with head
parallelism it beat the decomposed path at every size measured, 128–4096);
raise it on hosts where short-sequence decompose wins.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add six f16 element-wise activations on x86 AVX-512: sigmoid_f16, tanh_f16,
hardswish_f16, leaky_relu_f16, silu_f16, gelu_f16. Each kernel chunks the
input through a 64-byte-aligned f32 scratch (CHUNK=256), dispatches to the
matching f32 AVX-512 kernel (the avx512_sigmoid_f32 / avx512_tanh_f32
wrappers, or the act:: hardswish / leaky_relu / silu / gelu kernels), and
converts back to f16. silu and gelu compose sigmoid_f32 / tanh_f32 with the
final combine done in f32.

The f16 <-> f32 conversion is driven by vcvtph2ps / vcvtps2ph via std::arch
intrinsics (cvt_f16_to_f32 / cvt_f32_to_f16 helpers); rustc + LLVM do not
autovectorize the scalar f16::to_f32 / f16::from_f32 loops, which is why a
naive port leaves AVX-512 stuck at ~7 Melem/s.

Wires into Ops::{sigmoid,tanh,hardswish,leaky_relu,silu,gelu}_f16 from
plug_avx512f; non-AVX512 x86 keeps the generic scalar f16 kernels. Validated
against the generic H<Op>8 reference via the existing *_frame_tests! macros
at SuperApproximate tolerance, which covers the precision delta between
scalar f16 arithmetic and f32-internal computation.

Measured on Cascade Lake (single-thread, throughput Gelem/s):
  - sigmoid_f16:    0.016 -> 1.54   (96x)
  - tanh_f16:       0.018 -> 1.61   (92x)
  - hardswish_f16:  0.051 -> 9.46   (186x)
  - leaky_relu_f16: 0.96  -> 10.4   (11x; generic baseline is unexpectedly fast)
  - silu_f16:       0.20  -> 0.93   (4.6x)
  - gelu_f16:       0.11  -> 0.75   (6.7x)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kali kali force-pushed the feat/avx512-activations-f16 branch from a921dd9 to c696d59 Compare June 4, 2026 11:24
kali pushed a commit that referenced this pull request Jun 8, 2026
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.

Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).

Bench on Sapphire Rapids (n=1024, single thread, Criterion):
  hardswish_f16:
    generic              52.3 Melem/s
    avx512_f32roundtrip   8.71 Gelem/s   (current #8 path)
    avx512fp16_native    31.6 Gelem/s   (this PR, 3.62× over the roundtrip)

A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.

Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kali pushed a commit to sonos/tract that referenced this pull request Jun 8, 2026
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.

Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).

Bench on Sapphire Rapids (n=1024, single thread, Criterion):
  hardswish_f16:
    generic              52.3 Melem/s
    avx512_f32roundtrip   8.71 Gelem/s   (current czoli1976#8 path)
    avx512fp16_native    31.6 Gelem/s   (this PR, 3.62× over the roundtrip)

A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.

Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants