Skip to content

yc2367/P3-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

P3-LLM: Efficient Mixed-Precisioin & Mixed-Format W4A8KV4 LLM Quantization

arXiv

Implementation of mixed-precision W4A8KV4 quantization in P3-LLM. Our W4A8KV4 quantization outperforms state-of-the-art W4A8KV4 quantization algorithms such as QuaRot and QoQ. Furthermore, we perform 8-bit query and attention-score quantization using FP8-E4M3 and custom unsigned FP8-E4M4 data formats, respectively. This allows both linear layers and attention modules to be accelerated by low-precision arithmetic units.

1. Getting Started

Clone the repository and its 3rd-party submodules, including AWQ and LM-Evaluation-Harness.

git clone --recurse-submodules https://github.com/yc2367/P3-LLM.git

The quantization code base is inside the wkvaq_quant directory.

2. Obtaining AWQ quantized LLM

First, set up AWQ by followinig the instructions here.

Refer to wkvaq_quant/scripts/awq/run_awq.sh and perform AWQ weight quantization. At the top, change HOME_DIR to your AWQ directory. The string variable wq_dtype can be "int" or "bitmod", where the latter is a state-of-the-art 4-bit data type, as described in the BitMoD paper. The list variable w_bit_list and group_size_list contains the weight precision and group_size that you want to use. They are 4-bit and 128 by default.

After performing AWQ, you can run run_awq_save_4b_model.sh to evaluate the perplexity of quantized 4-bit model and save it. At the top, change AWQ_DIR to the directory where you want to save the fake-quantized model.

3. Evalaute mixed-precision LLM

Go to wkvaq_quant/scripts/test_ppl_template.sh and run Wikitext-2 and C4 perplexity evaluation. Currently, we only support Llama and Mistral models.

Change different quantization parameters by refering to their definition in wkvaq_quant/utils.py. Specifically,

  • --kv_quant_method: "KTVT" by default, which adopts per-token head KV-cache quantization. It can also take "KCVT", which uses per-channel key quantization as described here.
  • --kv_residual_len: Number of most recent tokens that are maintained in FP16 during KV-cache quantization. By default, this is 1, i.e., all KV-cache is quantized. Setting to a higher value will result in better accuracy.
  • --apply_k_scale: If set, then use our proposed dynamic per-channel key-cache smoothing.
  • --k_quant_post_rope: If set, then quantize key cache after RoPE, else quantize key cache before RoPE.
  • --p_bits: the precision of attention score.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published