MatMulNBits / Q4_0: int4-compute CPU kernel — W4A8 (int8-dot) or W4A16 (f16-unpack)?

Follow-up to #2340 (int4 `MatMulNBits` → fused `Q4_0` block-quant matmul) — a direction
question before I build the next piece. @kali

## Where #2340 leaves things
#2340 keeps the weight as a `Q4_0` constant instead of expanding it to f32 at load. What each
backend then does differs:
- **Metal**: the `Q4_0` weight goes straight to the existing ggml `mul_mv_q4_0` kernel — true
  int4 compute (locally ~2.3× vs f32 at K=N=4096, M=1).
- **CPU**: the block-quant packer dequantizes `Q4_0` → f32 and runs the f32 microkernel. So CPU
  gets the memory win (~7× smaller weights) but only a bandwidth-grade speedup (~1.1–1.33×),
  not int4 compute.

So there's clear CPU headroom. Since int4 isn't a compute type, a "true int4 CPU kernel" means
unpacking to int8 **or** f16 and riding an existing GEMM — and tract already has both. Two paths,
with a numerical-contract tradeoff:

## Option A — int8-dot (W4A8)
Unpack int4 → int8, quantize activations → int8, run the existing int8 GEMM kernels
(`sdot` #2278 / SME `smopa` #2279 / Intel AMX+VNNI #2339 / apple_amx).
- **Fastest** — rides all the int8 paths already in tree.
- **But**: activations become int8 too → effectively **W4A8**, a different contract than #2340
  (this is the llama.cpp `Q4_0 × Q8_0` approach; near-lossless in practice, but it *is* a change).

## Option B — f16-unpack (W4A16)
Unpack int4 → f16, use the existing f16 FMA kernels (`fmla` on NEON-fp16 / SVE).
- **Keeps #2340's exact contract** (f16/f32 activations).
- Faster than the current f32 dequant, slower than int8-dot. **No new ISA needed.**

## The shared part
Either way, the only new code is ISA-agnostic: a `Q4_0` → {int8 | f16} panel extractor (a variant
of `extract_panel`), plus — for A — on-the-fly activation quant + dual-scale combine. It reuses the
per-arch kernels already in tree; the one missing ISA piece is NEON `smmla` (I8MM), which would be a
bonus, not a prerequisite.

## Question
For the CPU int4 path, which numerical contract do you want — **A (W4A8, max speed)** or
**B (W4A16, preserve precision)**? Or both — e.g. B as the safe default and A as an opt-in fast
path? Happy to build whichever you prefer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MatMulNBits / Q4_0: int4-compute CPU kernel — W4A8 (int8-dot) or W4A16 (f16-unpack)? #2341

Where #2340 leaves things

Option A — int8-dot (W4A8)

Option B — f16-unpack (W4A16)

The shared part

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MatMulNBits / Q4_0: int4-compute CPU kernel — W4A8 (int8-dot) or W4A16 (f16-unpack)? #2341

Description

Where #2340 leaves things

Option A — int8-dot (W4A8)

Option B — f16-unpack (W4A16)

The shared part

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions