Add MXFP4 and NVFP4 quantization support by SuperMarioYL · Pull Request #99 · Andyyyy64/whichllm

SuperMarioYL · 2026-06-09T19:23:50Z

Closes #27.

What

Adds first-class support for the two 4-bit microscaling float quantization formats that recent local-LLM and hardware paths have started shipping:

MXFP4 — OCP Microscaling FP4 (E2M1 element + one E8M0 8-bit scale per block of 32), used by e.g. the GPT-OSS releases.
NVFP4 — NVIDIA FP4 (E2M1 element + one E4M3 8-bit scale per block of 16, plus a negligible per-tensor FP32 scale), targeting Blackwell-class hardware.

Why

NVFP4 was already half-wired: it appeared in the family-normalization, benchmark pre-strip, and prequantized-repo regexes, but was missing from the quantization tables. The net effect was a latent bug — an …-NVFP4 repo fell through infer_non_gguf_quant_type to the FP16 default, so it was labeled FP16 in the output and its VRAM was overestimated ~3.5× (2.0 vs 0.5625 bytes/weight). MXFP4 was not recognized anywhere, so --quant MXFP4 returned no candidates and MXFP4 repos were orphaned from their base family during grouping.

This PR makes both formats consistent across every path that already knew about the older quant formats.

Changes

Area	File	Change
Bytes-per-weight	`data/quantization.py`, `engine/quantization.py`	MXFP4 `0.53125` (4.25 bits), NVFP4 `0.5625` (4.5 bits)
Quality penalty	`data/quantization.py`, `engine/quantization.py`	NVFP4 `0.05` (par with Q4_K_M — finer per-16 E4M3 scale), MXFP4 `0.06` (coarser per-32 E8M0 power-of-two scale)
Speed efficiency	`engine/performance.py`	NVFP4 `0.56`, MXFP4 `0.55` (native FP4 tensor-core paths are weight-read-bound)
Preference order	`data/quantization.py`	inserted in the 4-bit tier
ID parsing	`engine/quantization.py`	anchored `(^\|[-_/])mxfp4($\|[-_/])` / `nvfp4` patterns
GGUF filename parsing	`models/fetcher.py`	`_extract_quant_type` recognizes `.MXFP4.gguf` / `.NVFP4.gguf`
Family grouping	`models/grouper.py`	strip `-mxfp4` / `-nvfp4` suffixes
Benchmark pre-strip	`models/benchmark.py`	add `mxfp4` to the suffix alternation (nvfp4 already present)

Output labeling needs no change — display.py already derives the label from effective_quant_type, which now returns the correct format.

Bytes-per-weight derivation

MXFP4: (4·32 + 8) / 32 = 4.25 bits = 0.53125 B/w
NVFP4: (4·16 + 8) / 16 = 4.5 bits = 0.5625 B/w (per-tensor FP32 scale amortizes to ~0)

Tests

Covers the issue's "Done when" list:

parsing from model IDs (test_infer_mxfp4, test_infer_nvfp4) and from GGUF filenames (test_extract_quant_type_parses_fp4_gguf_filenames)
a negative case confirming the anchored patterns don't false-match plain IDs
weight-byte and VRAM estimation for both formats
--quant MXFP4 returning a candidate on a runnable (Linux+NVIDIA) host and being correctly filtered out on a GGUF-only backend
family grouping collapsing …-MXFP4 / …-NVFP4 onto the base family
the output label assertion uses the same effective_quant_type call as display.py

Full suite: 301 passed; ruff check . and ruff format --check . clean.

Recognize the OCP MXFP4 and NVIDIA NVFP4 4-bit microscaling float formats across parsing, family grouping, size/VRAM estimation, speed estimation, ranking and the --quant filter. NVFP4 was previously referenced only in the family/benchmark/prequantized-repo regexes but absent from the quantization tables, so NVFP4 repositories were mislabeled as FP16 and their VRAM was overestimated (~2.0 vs 0.5625 bytes/weight). MXFP4 was not recognized anywhere, so --quant MXFP4 returned no candidates and MXFP4 repos were orphaned from their base family. This makes both formats consistent end-to-end. - bytes/weight: MXFP4 0.53125 (E2M1 + E8M0 scale / 32), NVFP4 0.5625 (E2M1 + E4M3 scale / 16) - quality penalty: NVFP4 0.05 (par with Q4_K_M; finer per-16 scale), MXFP4 0.06 (coarser per-32 power-of-two scale) - parsed from model ids and GGUF filenames; family grouping and output labels follow automatically Closes Andyyyy64#27

Andyyyy64 · 2026-06-10T04:22:55Z

Merged, thank you. I reproduced the latent bug you found before merging: on main, both nvidia/...-NVFP4 and ...-MXFP4 repos infer as FP16, so VRAM was overestimated about 3.5x, exactly as described. Bytes-per-weight derivations check out, 301 tests pass locally. Nice catch on the half-wired NVFP4 state.

Andyyyy64 merged commit d02da1f into Andyyyy64:main Jun 10, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MXFP4 and NVFP4 quantization support#99

Add MXFP4 and NVFP4 quantization support#99
Andyyyy64 merged 1 commit into
Andyyyy64:mainfrom
SuperMarioYL:feature/mxfp4-nvfp4-quant

SuperMarioYL commented Jun 9, 2026

Uh oh!

Uh oh!

Andyyyy64 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SuperMarioYL commented Jun 9, 2026

What

Why

Changes

Bytes-per-weight derivation

Tests

Uh oh!

Uh oh!

Andyyyy64 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants