Skip to content

MatGPTQ is a one-shot quantization technique that quantizes a model to multiple bit-widths, which can be served in different environments by leveraging custom kernel support.

License

Notifications You must be signed in to change notification settings

IST-DASLab/MatGPTQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MatGPTQ

arXiv

MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

Official implementation of MatGPTQ (Matryoshka GPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set.

Abstract

Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization-aware training (QAT) variants, rather than fast one–shot post training quantization (PTQ), and lacks open-source and kernel support. We address all of these limitations by introducing Post-Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi–precision objective with bit-slicing and cross–bit error compensation, resulting in an algorithm that produces a multi-bit-width, "sliceable" model in a single pass. We also incorporate a new budget–aware search for heterogeneous per–layer bit-witdhs and provide efficient kernels that implement slicing and mixed–precision execution. Across standard LLMs and benchmarks, MatGPTQ preserves high–bit accuracy while substantially improving performance at low-bit-witdh settings. Overall, we establish a new state of the art for Matryoshka–style post–training quantization and make single–checkpoint, multi–precision deployment open and practical.

Repository structure

  • scripts/ — contains bash scripts with the required arguments to run the method
  • src/ — directory for helper methods and utility functions
  • evo_quant_search.py — evolutionary quantization bitwidth allocation
  • quant.py — MatGPTQ/GPTQ quantization
  • lmeval.py — LM Eval Harness evalution script
  • eval_ppl.py — perplexity evalution script

Installation

Create a virtual environment and install dependencies (we recommend Python 3.12):

uv venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt

Note: The code has been tested with CUDA 12.4 and PyTorch 2.7.1

Quantization

We provide quant.py for producing the MatGPTQ/GPTQ models. To produce the respective model see either scripts/run_gptq.sh or scripts/run_matgptq.sh for examples on how to run quantized training:

bash scripts/run_matgptq.sh

Mix'n'Match

We provide evo_quant_search.py for producing the Mix'n'Match MatGPTQ models. To produce the respective model see scripts/run_quant_search.sh for an example on how to run EvoPress for MatGPTQ:

bash scripts/run_quant_search.sh

Evaluations

We provide lmeval.py and eval_ppl.py scripts for evaluation on the Language Model Evaluation Harness benchmarks and perplexity measurements. The interface of lmeval.py mostly follows the instructions from the original. In addition, one should specify the path to quantized weights via the quant_weights_path argument and the default uniform quantization bitwidth quant_uniform_bitwidth and master bitwidth --quant_master_bitwidth, or a path to a .txt file with chosen compression levels via the --quant_non_uniform_config_path argument. Furthermore, with --method, you define whether to evaluate MatGPTQ or GPTQ.

Deployment

Work In Progress

Citation

If you use MatGPTQ in your research, please cite:

@misc{kleinegger2026matgptqaccurateefficientposttraining,
      title={MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization}, 
      author={Maximilian Kleinegger and Elvir Crnčević and Dan Alistarh},
      year={2026},
      eprint={2602.03537},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.03537}, 
}

About

MatGPTQ is a one-shot quantization technique that quantizes a model to multiple bit-widths, which can be served in different environments by leveraging custom kernel support.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published