MLX nvfp4 uses signed E4M3 scales instead of unsigned UE4M3, causing 137x less dynamic range than NVIDIA Blackwell and inferior to Marlin

## Problem

MLX's nvfp4 implementation uses **signed E4M3** (1 sign bit, 4-bit exponent with bias 7, 3-bit mantissa) for block scaling factors, but NVIDIA Blackwell GPUs use **unsigned E4M3 (UE4M3)** with no sign bit and no exponent bias, as specified in the [PTX documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-scale-factor-a).

### Dynamic Range Comparison

- **MLX nvfp4 scales**:  Max value **448** ((1+7/8) × 2^7)
- **NVIDIA UE4M3 scales**: Max value **~61,440** ((1+7/8) × 2^15)
- **Difference**:  NVIDIA has **137x more dynamic range** for scale factors

This means MLX's nvfp4 will saturate/clip on large activations and weights that NVIDIA's implementation handles without issue.  Combined with the fact that **Marlin's MR-GPTQ quantization for FP4 achieves superior accuracy** (recovering 96.1% of FP16 accuracy), MLX's current approach is both format-incompatible and suboptimal.

## Code References

**Metal backend scale dequantization** (`mlx/backend/metal/kernels/fp_quantized.h: 30-38`):
```cpp
template <typename T, int group_size>
static inline T dequantize_scale(uint8_t s) {
  if constexpr (group_size == 16) {
    // Use nv scale - BUT THIS IS SIGNED E4M3, NOT UE4M3! 
    return T(*(thread fp8_e4m3*)(&s));  
  } else {
    return T(*(thread fp8_e8m0*)(&s));
  }
}
```

**Metal quantize kernel** (`mlx/backend/metal/kernels/fp_quantized.h:1763`):
```cpp
using ScaleType = metal::conditional_t<use_mx_scale, fp8_e8m0, fp8_e4m3>;
auto s = ScaleType(scale);  // Wrong:  should be UE4M3
```

**FP8 E4M3 struct** (`mlx/backend/metal/kernels/fp8. h:1-48`) implements **signed** E4M3, not unsigned. 

## References
- [NVIDIA PTX UE4M3 Specification](https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-scale-factor-a)
- [Dynamic Range Analysis](https://gist.githubusercontent.com/Nottlespike/9cea5bfab715a6ead2de687c46e8d296/raw/3148661c37e328594f66caeb063a3d816476b3dd/nvfp4_dynamic_range_comparison.py)
- [Marlin MR-GPTQ Paper](https://arxiv.org/html/2509.23202v1) - Shows MR-GPTQ for FP4 recovers 96.1% FP16 accuracy
- PR #2946 that introduced nvfp4: https://github.com/ml-explore/mlx/pull/2946

## Recommendation
1. **Implement UE4M3 format** for nvfp4 scale factors (unsigned, 4-bit exponent without bias, 3-bit mantissa)
2. **Consider adding MR-GPTQ support** as Marlin's approach substantially improves FP4 quantization quality
3. Update both Metal and CUDA backends to match NVIDIA's Blackwell specification

This will: 
- Achieve full dynamic range parity with NVIDIA hardware (137x improvement)
- Enable quantization of models with large magnitude weights/activations
- Improve compatibility with NVIDIA-quantized models
- Open path to superior MR-GPTQ quantization methods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLX nvfp4 uses signed E4M3 scales instead of unsigned UE4M3, causing 137x less dynamic range than NVIDIA Blackwell and inferior to Marlin #2962

Problem

Dynamic Range Comparison

Code References

References

Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MLX nvfp4 uses signed E4M3 scales instead of unsigned UE4M3, causing 137x less dynamic range than NVIDIA Blackwell and inferior to Marlin #2962

Description

Problem

Dynamic Range Comparison

Code References

References

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions