Follow-up to #2340 (int4 MatMulNBits → fused Q4_0 block-quant matmul) — a direction
question before I build the next piece. @kali
Where #2340 leaves things
#2340 keeps the weight as a Q4_0 constant instead of expanding it to f32 at load. What each
backend then does differs:
- Metal: the
Q4_0 weight goes straight to the existing ggml mul_mv_q4_0 kernel — true
int4 compute (locally ~2.3× vs f32 at K=N=4096, M=1).
- CPU: the block-quant packer dequantizes
Q4_0 → f32 and runs the f32 microkernel. So CPU
gets the memory win (~7× smaller weights) but only a bandwidth-grade speedup (~1.1–1.33×),
not int4 compute.
So there's clear CPU headroom. Since int4 isn't a compute type, a "true int4 CPU kernel" means
unpacking to int8 or f16 and riding an existing GEMM — and tract already has both. Two paths,
with a numerical-contract tradeoff:
Option A — int8-dot (W4A8)
Unpack int4 → int8, quantize activations → int8, run the existing int8 GEMM kernels
(sdot #2278 / SME smopa #2279 / Intel AMX+VNNI #2339 / apple_amx).
Option B — f16-unpack (W4A16)
Unpack int4 → f16, use the existing f16 FMA kernels (fmla on NEON-fp16 / SVE).
The shared part
Either way, the only new code is ISA-agnostic: a Q4_0 → {int8 | f16} panel extractor (a variant
of extract_panel), plus — for A — on-the-fly activation quant + dual-scale combine. It reuses the
per-arch kernels already in tree; the one missing ISA piece is NEON smmla (I8MM), which would be a
bonus, not a prerequisite.
Question
For the CPU int4 path, which numerical contract do you want — A (W4A8, max speed) or
B (W4A16, preserve precision)? Or both — e.g. B as the safe default and A as an opt-in fast
path? Happy to build whichever you prefer.
Follow-up to #2340 (int4
MatMulNBits→ fusedQ4_0block-quant matmul) — a directionquestion before I build the next piece. @kali
Where #2340 leaves things
#2340 keeps the weight as a
Q4_0constant instead of expanding it to f32 at load. What eachbackend then does differs:
Q4_0weight goes straight to the existing ggmlmul_mv_q4_0kernel — trueint4 compute (locally ~2.3× vs f32 at K=N=4096, M=1).
Q4_0→ f32 and runs the f32 microkernel. So CPUgets the memory win (~7× smaller weights) but only a bandwidth-grade speedup (~1.1–1.33×),
not int4 compute.
So there's clear CPU headroom. Since int4 isn't a compute type, a "true int4 CPU kernel" means
unpacking to int8 or f16 and riding an existing GEMM — and tract already has both. Two paths,
with a numerical-contract tradeoff:
Option A — int8-dot (W4A8)
Unpack int4 → int8, quantize activations → int8, run the existing int8 GEMM kernels
(
sdot#2278 / SMEsmopa#2279 / Intel AMX+VNNI #2339 / apple_amx).(this is the llama.cpp
Q4_0 × Q8_0approach; near-lossless in practice, but it is a change).Option B — f16-unpack (W4A16)
Unpack int4 → f16, use the existing f16 FMA kernels (
fmlaon NEON-fp16 / SVE).The shared part
Either way, the only new code is ISA-agnostic: a
Q4_0→ {int8 | f16} panel extractor (a variantof
extract_panel), plus — for A — on-the-fly activation quant + dual-scale combine. It reuses theper-arch kernels already in tree; the one missing ISA piece is NEON
smmla(I8MM), which would be abonus, not a prerequisite.
Question
For the CPU int4 path, which numerical contract do you want — A (W4A8, max speed) or
B (W4A16, preserve precision)? Or both — e.g. B as the safe default and A as an opt-in fast
path? Happy to build whichever you prefer.