P3-LLM: Efficient Mixed-Precisioin & Mixed-Format W4A8KV4 LLM Quantization

Implementation of mixed-precision W4A8KV4 quantization in P3-LLM. Our W4A8KV4 quantization outperforms state-of-the-art W4A8KV4 quantization algorithms such as QuaRot and QoQ. Furthermore, we perform 8-bit query and attention-score quantization using FP8-E4M3 and custom unsigned FP8-E4M4 data formats, respectively. This allows both linear layers and attention modules to be accelerated by low-precision arithmetic units.

1. Getting Started

Clone the repository and its 3rd-party submodules, including AWQ and LM-Evaluation-Harness.

git clone --recurse-submodules https://github.com/yc2367/P3-LLM.git

The quantization code base is inside the wkvaq_quant directory.

2. Obtaining AWQ quantized LLM

First, set up AWQ by followinig the instructions here.

Refer to wkvaq_quant/scripts/awq/run_awq.sh and perform AWQ weight quantization. At the top, change HOME_DIR to your AWQ directory. The string variable wq_dtype can be "int" or "bitmod", where the latter is a state-of-the-art 4-bit data type, as described in the BitMoD paper. The list variable w_bit_list and group_size_list contains the weight precision and group_size that you want to use. They are 4-bit and 128 by default.

After performing AWQ, you can run run_awq_save_4b_model.sh to evaluate the perplexity of quantized 4-bit model and save it. At the top, change AWQ_DIR to the directory where you want to save the fake-quantized model.

3. Evalaute mixed-precision LLM

Go to wkvaq_quant/scripts/test_ppl_template.sh and run Wikitext-2 and C4 perplexity evaluation. Currently, we only support Llama and Mistral models.

Change different quantization parameters by refering to their definition in wkvaq_quant/utils.py. Specifically,

--kv_quant_method: "KTVT" by default, which adopts per-token head KV-cache quantization. It can also take "KCVT", which uses per-channel key quantization as described here.
--kv_residual_len: Number of most recent tokens that are maintained in FP16 during KV-cache quantization. By default, this is 1, i.e., all KV-cache is quantized. Setting to a higher value will result in better accuracy.
--apply_k_scale: If set, then use our proposed dynamic per-channel key-cache smoothing.
--k_quant_post_rope: If set, then quantize key cache after RoPE, else quantize key cache before RoPE.
--p_bits: the precision of attention score.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
3rdparty		3rdparty
kv_profile		kv_profile
wkvaq_quant		wkvaq_quant
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

P3-LLM: Efficient Mixed-Precisioin & Mixed-Format W4A8KV4 LLM Quantization

1. Getting Started

2. Obtaining AWQ quantized LLM

3. Evalaute mixed-precision LLM

About

Uh oh!

Releases

Packages

Languages

License

yc2367/P3-LLM

Folders and files

Latest commit

History

Repository files navigation

P3-LLM: Efficient Mixed-Precisioin & Mixed-Format W4A8KV4 LLM Quantization

1. Getting Started

2. Obtaining AWQ quantized LLM

3. Evalaute mixed-precision LLM

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages