The most of the codes are borrowed from PEFT docs. Also, Bitsandbytes docs describe the basic information. I just change hyper-parameters such as batch size, etc.
- login to huggingface cli
pip install -U "huggingface_hub[cli]"
huggingface-cli login # you need to generate token (just follow CMD prompts)- install required library
pip install -U bitsandbytes accelerate transformers peft trl This is the library that I used.
accelerate-1.6.0
datasets-3.5.0
peft-0.15.2
pyarrow-19.0.1
requests-2.32.3
tokenizers-0.21.1
transformers-4.51.3
trl-0.17.0
In case you use conda, please run conda env create -f environment.yml -n peft
- run script. please change the cuda device and model parameter size.
for param in 7 13; do bash script/single.sh 0, $param; done
for param in 7 13; do bash script/ddp_qlora.sh 0,1 $param; done
for param in 7 13 30 65; do bash script/fsdp_qlora.sh 0,1 $param; done- summarize train latency like the below examples.
- 7b: 10 sec
- 13b: 20 sec
- 33b: 30 sec
- 65b: 60 sec
- uncomment L#171 of
train.py. - run
./ddp_fsdp_qlora.sh 0,1,2,3 7
Note:
- It will run two main process on GPU group 1 (0,1) and GPU group 2 (2,3).
- It will train LoRA adapter with the frozen quantized model. You can see the
script/folder and easily change the backbone from quantized to FP16. - Current DDP+FSDP implementation is not perfect. The logger and saving checkpoints will be performed multiple times.