Skip to content

Add MXFP4 and NVFP4 quantization support#99

Merged
Andyyyy64 merged 1 commit into
Andyyyy64:mainfrom
SuperMarioYL:feature/mxfp4-nvfp4-quant
Jun 10, 2026
Merged

Add MXFP4 and NVFP4 quantization support#99
Andyyyy64 merged 1 commit into
Andyyyy64:mainfrom
SuperMarioYL:feature/mxfp4-nvfp4-quant

Conversation

@SuperMarioYL

Copy link
Copy Markdown
Contributor

Closes #27.

What

Adds first-class support for the two 4-bit microscaling float quantization formats that recent local-LLM and hardware paths have started shipping:

  • MXFP4 — OCP Microscaling FP4 (E2M1 element + one E8M0 8-bit scale per block of 32), used by e.g. the GPT-OSS releases.
  • NVFP4 — NVIDIA FP4 (E2M1 element + one E4M3 8-bit scale per block of 16, plus a negligible per-tensor FP32 scale), targeting Blackwell-class hardware.

Why

NVFP4 was already half-wired: it appeared in the family-normalization, benchmark pre-strip, and prequantized-repo regexes, but was missing from the quantization tables. The net effect was a latent bug — an …-NVFP4 repo fell through infer_non_gguf_quant_type to the FP16 default, so it was labeled FP16 in the output and its VRAM was overestimated ~3.5× (2.0 vs 0.5625 bytes/weight). MXFP4 was not recognized anywhere, so --quant MXFP4 returned no candidates and MXFP4 repos were orphaned from their base family during grouping.

This PR makes both formats consistent across every path that already knew about the older quant formats.

Changes

Area File Change
Bytes-per-weight data/quantization.py, engine/quantization.py MXFP4 0.53125 (4.25 bits), NVFP4 0.5625 (4.5 bits)
Quality penalty data/quantization.py, engine/quantization.py NVFP4 0.05 (par with Q4_K_M — finer per-16 E4M3 scale), MXFP4 0.06 (coarser per-32 E8M0 power-of-two scale)
Speed efficiency engine/performance.py NVFP4 0.56, MXFP4 0.55 (native FP4 tensor-core paths are weight-read-bound)
Preference order data/quantization.py inserted in the 4-bit tier
ID parsing engine/quantization.py anchored (^|[-_/])mxfp4($|[-_/]) / nvfp4 patterns
GGUF filename parsing models/fetcher.py _extract_quant_type recognizes *.MXFP4.gguf / *.NVFP4.gguf
Family grouping models/grouper.py strip -mxfp4 / -nvfp4 suffixes
Benchmark pre-strip models/benchmark.py add mxfp4 to the suffix alternation (nvfp4 already present)

Output labeling needs no change — display.py already derives the label from effective_quant_type, which now returns the correct format.

Bytes-per-weight derivation

  • MXFP4: (4·32 + 8) / 32 = 4.25 bits = 0.53125 B/w
  • NVFP4: (4·16 + 8) / 16 = 4.5 bits = 0.5625 B/w (per-tensor FP32 scale amortizes to ~0)

Tests

Covers the issue's "Done when" list:

  • parsing from model IDs (test_infer_mxfp4, test_infer_nvfp4) and from GGUF filenames (test_extract_quant_type_parses_fp4_gguf_filenames)
  • a negative case confirming the anchored patterns don't false-match plain IDs
  • weight-byte and VRAM estimation for both formats
  • --quant MXFP4 returning a candidate on a runnable (Linux+NVIDIA) host and being correctly filtered out on a GGUF-only backend
  • family grouping collapsing …-MXFP4 / …-NVFP4 onto the base family
  • the output label assertion uses the same effective_quant_type call as display.py

Full suite: 301 passed; ruff check . and ruff format --check . clean.

Recognize the OCP MXFP4 and NVIDIA NVFP4 4-bit microscaling float formats
across parsing, family grouping, size/VRAM estimation, speed estimation,
ranking and the --quant filter.

NVFP4 was previously referenced only in the family/benchmark/prequantized-repo
regexes but absent from the quantization tables, so NVFP4 repositories were
mislabeled as FP16 and their VRAM was overestimated (~2.0 vs 0.5625
bytes/weight). MXFP4 was not recognized anywhere, so --quant MXFP4 returned no
candidates and MXFP4 repos were orphaned from their base family. This makes
both formats consistent end-to-end.

- bytes/weight: MXFP4 0.53125 (E2M1 + E8M0 scale / 32), NVFP4 0.5625
  (E2M1 + E4M3 scale / 16)
- quality penalty: NVFP4 0.05 (par with Q4_K_M; finer per-16 scale),
  MXFP4 0.06 (coarser per-32 power-of-two scale)
- parsed from model ids and GGUF filenames; family grouping and output
  labels follow automatically

Closes Andyyyy64#27
@Andyyyy64 Andyyyy64 merged commit d02da1f into Andyyyy64:main Jun 10, 2026
4 checks passed
@Andyyyy64

Copy link
Copy Markdown
Owner

Merged, thank you. I reproduced the latent bug you found before merging: on main, both nvidia/...-NVFP4 and ...-MXFP4 repos infer as FP16, so VRAM was overestimated about 3.5x, exactly as described. Bytes-per-weight derivations check out, 301 tests pass locally. Nice catch on the half-wired NVFP4 state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MXFP4 and NVFP4 quantization support

2 participants