Preprocessing datasets for SFT or DPO.
- Linux
- Python 3.10
a. Create a conda virtual environment and activate it.
conda create -n LLM_data_processing python=3.10
conda activate LLM_data_processingb. Clone this repository.
git clone git@bitbucket.org:ibricks-rnd/llm_data_preprocess.git
cd llm_data_preprocessc. Install requirments.
pip install -r requirements.txtpython main.py instruction=version_4.1 main.version=ver_4.1 main.process_type=instructionOutput data format is as follow
{
"chat_template": List[
{
"content": str,
"role": str // One of 'system', 'user', 'assistant'
}
...
],
"source": str,
}In the example:
"prompt"is a list containing dictionaries."content"is a string."role"is a string and can be one of'system','user', or'assistant'.
"source"is a string that indicates where the data comes from."
If you want to make data for dpo, set save_data_for_dpo as true Note that, this dpo dataset has only prompt and chosen(label). you should build reject yourself.
python main.py dpo=version_2.0_wo_chat main.version=ver_2.0_wo_chat main.process_type=dpo