Skip to content

reka-ai/rekaquant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reka Quant

Reka Quant is a model quantization library. It supports:

  • NF4 and GGML (llama.cpp) quantization primitives. GGML primitives are added directly from its source code through python cffi bindings, making it easy to incorporate new ones.
  • Exporting of GGML quantized models to native GGUF format, for easy integration with the existing ecosystem.
  • Activation-aware quantization by leveraging precomputed activation statistics from a text sample, through the LDLQ method from QuIP.
  • Further eror reduction through self-distillation from the BF16 model, while quantizing the network gradually.
  • Fast multi-node training through full or hybrid FSDP, as well as fast parallel proxy Hessian computation for LDLQ.

Installation

Clone the library with submodules:

git clone --recurse-submodules [email protected]:reka-ai/quantization.git

Install requirements:

poetry install

Build the shared library in csrc, needed for python bindings.

cd csrc
gcc -shared -o quantize.so -fPIC quantize.c
cd ..

Exporting to GGUF formats requires a patch to the llama.cpp library, apply it and install the library.

cd third_party/llama.cpp
git apply ../../patches/RekaQuant.patch
cmake -B build
cmake --build build --config Release
cd ../..

Usage

The main script is train.py. The training data should be in jsonl format with documents in the "text" field.

torchrun \
    ...distributed flags.. \
    python3 src/train.py \
    --model_path $model_path \
    --ref_model $ref_model \
    --out_path $out_path \
    --train_data $train_data \
    --hessian_corr 1e-1 \
    --hessian_train_seq 4096   \
    --total_train_steps 1800 \
    --lr 1e-5 \
    --global_batch_size 512  \
    --seq_len 8192 \
    --micro_batch_size 1 \
    --checkpoint_iters 100 \
    --valid_seq 64 \
    --quant_strategy typewise_Q3_K_S \
    --use_checkpointing \

An example slurm script can be found in run_train.slurm:

export REF_MODEL_PATH=/path/to/model
export OUT_PATH=/path/to/output
export TRAIN_DATA=/path/to/train.jsonl

sbatch run_train.slurm

When training smaller models, you can enable the --use_hybrid flag to use hybrid FSDP (shard intra-node, replicate across nodes) for reduced communication and higher efficiency, and remove the --use_checkpointing flag to disable activation checkpointing.

Once the model is trained, if you used GGUF quants you will need to export it to a native GGUF file. You can see the scripts/prepare_ckpt.sh script for an example of how to do this.

cd scripts
bash prepare_ckpt.sh $OUT_PATH/iter_001800/ #GGUF ckpt saved under $OUT_PATH/iter_001800/hf_model/Q3_K_S_RekaQuant_hf

NOTE: GGML K-Quants require tensors to have a number of columns divisible by 256. You can use the helper script in scripts/pad_intermediate.py if needed to preprocess models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages