Skip to content

MLX nvfp4 uses signed E4M3 scales instead of unsigned UE4M3, causing 137x less dynamic range than NVIDIA Blackwell and inferior to Marlin #2962

@Nottlespike

Description

@Nottlespike

Problem

MLX's nvfp4 implementation uses signed E4M3 (1 sign bit, 4-bit exponent with bias 7, 3-bit mantissa) for block scaling factors, but NVIDIA Blackwell GPUs use unsigned E4M3 (UE4M3) with no sign bit and no exponent bias, as specified in the PTX documentation.

Dynamic Range Comparison

  • MLX nvfp4 scales: Max value 448 ((1+7/8) × 2^7)
  • NVIDIA UE4M3 scales: Max value ~61,440 ((1+7/8) × 2^15)
  • Difference: NVIDIA has 137x more dynamic range for scale factors

This means MLX's nvfp4 will saturate/clip on large activations and weights that NVIDIA's implementation handles without issue. Combined with the fact that Marlin's MR-GPTQ quantization for FP4 achieves superior accuracy (recovering 96.1% of FP16 accuracy), MLX's current approach is both format-incompatible and suboptimal.

Code References

Metal backend scale dequantization (mlx/backend/metal/kernels/fp_quantized.h: 30-38):

template <typename T, int group_size>
static inline T dequantize_scale(uint8_t s) {
  if constexpr (group_size == 16) {
    // Use nv scale - BUT THIS IS SIGNED E4M3, NOT UE4M3! 
    return T(*(thread fp8_e4m3*)(&s));  
  } else {
    return T(*(thread fp8_e8m0*)(&s));
  }
}

Metal quantize kernel (mlx/backend/metal/kernels/fp_quantized.h:1763):

using ScaleType = metal::conditional_t<use_mx_scale, fp8_e8m0, fp8_e4m3>;
auto s = ScaleType(scale);  // Wrong:  should be UE4M3

FP8 E4M3 struct (mlx/backend/metal/kernels/fp8. h:1-48) implements signed E4M3, not unsigned.

References

Recommendation

  1. Implement UE4M3 format for nvfp4 scale factors (unsigned, 4-bit exponent without bias, 3-bit mantissa)
  2. Consider adding MR-GPTQ support as Marlin's approach substantially improves FP4 quantization quality
  3. Update both Metal and CUDA backends to match NVIDIA's Blackwell specification

This will:

  • Achieve full dynamic range parity with NVIDIA hardware (137x improvement)
  • Enable quantization of models with large magnitude weights/activations
  • Improve compatibility with NVIDIA-quantized models
  • Open path to superior MR-GPTQ quantization methods

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions