Skip to content

MatMulNBits / Q4_0: int4-compute CPU kernel — W4A8 (int8-dot) or W4A16 (f16-unpack)? #2341

@czoli1976

Description

@czoli1976

Follow-up to #2340 (int4 MatMulNBits → fused Q4_0 block-quant matmul) — a direction
question before I build the next piece. @kali

Where #2340 leaves things

#2340 keeps the weight as a Q4_0 constant instead of expanding it to f32 at load. What each
backend then does differs:

  • Metal: the Q4_0 weight goes straight to the existing ggml mul_mv_q4_0 kernel — true
    int4 compute (locally ~2.3× vs f32 at K=N=4096, M=1).
  • CPU: the block-quant packer dequantizes Q4_0 → f32 and runs the f32 microkernel. So CPU
    gets the memory win (~7× smaller weights) but only a bandwidth-grade speedup (~1.1–1.33×),
    not int4 compute.

So there's clear CPU headroom. Since int4 isn't a compute type, a "true int4 CPU kernel" means
unpacking to int8 or f16 and riding an existing GEMM — and tract already has both. Two paths,
with a numerical-contract tradeoff:

Option A — int8-dot (W4A8)

Unpack int4 → int8, quantize activations → int8, run the existing int8 GEMM kernels
(sdot #2278 / SME smopa #2279 / Intel AMX+VNNI #2339 / apple_amx).

Option B — f16-unpack (W4A16)

Unpack int4 → f16, use the existing f16 FMA kernels (fmla on NEON-fp16 / SVE).

The shared part

Either way, the only new code is ISA-agnostic: a Q4_0 → {int8 | f16} panel extractor (a variant
of extract_panel), plus — for A — on-the-fly activation quant + dual-scale combine. It reuses the
per-arch kernels already in tree; the one missing ISA piece is NEON smmla (I8MM), which would be a
bonus, not a prerequisite.

Question

For the CPU int4 path, which numerical contract do you want — A (W4A8, max speed) or
B (W4A16, preserve precision)? Or both — e.g. B as the safe default and A as an opt-in fast
path? Happy to build whichever you prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions