GPTQ-Pro is an experimental, performance-focused fork of
ModelCloud/GPTQModel, built around a single
quantization path: a custom GPTQ INT4 CUDA kernel tuned for Ampere-class NVIDIA GPUs
(RTX 3090, RTX 3060, A100 — sm_80/sm_86/sm_87).
All other backends (AWQ, Marlin, ExllamaV2/V3, BitBLAS, Machete, QQQ, BitsAndBytes, GGUF, FP8, RTN, VLLM, SGLang, MLX) have been removed. GPTQ-Pro is the only inference and quantization path.
This is not the official ModelCloud release.
For stable upstream usage, use ModelCloud/GPTQModel.
📋 See
docs/ASSESSMENT_AND_ROADMAP.mdfor a fact-checked assessment of current quantization quality and a prioritized, Ampere-focused improvement roadmap.
A stripped-down, single-backend research build with three goals:
- Clean kernel development surface — one CUDA kernel to optimize, no multi-backend branching in hot paths.
- Maximum quantization quality — the full GPTQ quality toolbox is available and documented: act-order (GAR), activation-weighted MSE scale search, GPTAQ error feedback, FOEM 1st-order error minimization, Hadamard incoherence rotation, and adaptive Cholesky damping.
- Ampere-first — build flags, kernel design, and validation are all centered on
sm_80/sm_86/sm_87consumer and datacenter cards.
| Dimension | Values |
|---|---|
| FORMAT | GPTQ, GPTQ_V2 |
| METHOD | GPTQ |
| BACKEND | AUTO, AUTO_TRAINABLE, GPTQ_PRO |
gptqmodel_ext/gptq_pro/ — custom Ampere INT4 dequant GEMM:
mma.syncFP32 accumulation on Tensor Cores- Symmetric INT4 weight packing (4-bit nibble, group-based scales)
- Priority 120 — always selected over any fallback
Current state is a performance scaffold (one warp/CTA, no cp.async pipeline, no GEMV
decode path). See the roadmap doc for the planned improvements.
All levers are optional and composable on top of plain GPTQ:
| Feature | Config field | What it does |
|---|---|---|
| Act-order / GAR | act_group_aware |
Reorder columns by activation magnitude before grouping |
| MSE scale search | mse |
Search for optimal per-group scale (0 = off, ~2.0 = recommended) |
| Activation-weighted MSE | activation_weighted_mse |
Weight scale search by calibration activations |
| GPTAQ | gptaq |
Activation-error feedback after each layer |
| FOEM | foem |
1st-order error minimization pass |
| Hadamard rotation | rotation="hadamard" |
Incoherence processing for ≤3-bit |
| Adaptive damping | damp_percent |
Cholesky regularization |
A max_quality preset that enables all of the above is available:
from gptqmodel import GPTQModel, QuantizeConfig
qcfg = QuantizeConfig.max_quality(bits=4, group_size=128)
model = GPTQModel.load("meta-llama/Llama-3.1-8B", quantize_config=qcfg)
model.quantize(calibration_dataset)
model.save("Llama-3.1-8B-GPTQ-Pro-4bit")All model families from the GPTQModel foundation are supported, including:
- Qwen3 / Qwen3.5 / Qwen3.5-MoE (including multimodal vision)
- LLaMA 3.x / LLaMA 4
- Gemma 2 / Gemma 3 / Gemma 4
- Mistral / Mixtral
- Phi-3 / Phi-4
- DeepSeek-V2 / DeepSeek-V3
- OLMo / OLMoE
- And many others — see
gptqmodel/models/definitions/
Primary development and validation targets:
- RTX 3090 (
sm_86) - RTX 3060 (
sm_86) - A100 (
sm_80)
git clone https://github.com/groxaxo/GPTQ-Pro.git
cd GPTQ-Pro
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip wheel setuptools
pip install -e .Massive credit to Qubitium and the ModelCloud team for building and maintaining GPT-QModel, and to the original GPTQ authors:
- Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh — GPTQ
- PanQiWei — AutoGPTQ (GPT-QModel's historical foundation)
- FXMarty — AutoGPTQ maintenance
- Qwopqwop200 — GPTQ-for-LLaMa
GPTQ-Pro is a fork, not a reinvention. The upstream authors did the hard foundational engineering. This branch strips the build down to a single kernel research track.