Process dataset

Preprocessing datasets for SFT or DPO.

Quickstart

Experiment Setting

Linux
Python 3.10

a. Create a conda virtual environment and activate it.

conda create -n LLM_data_processing python=3.10
conda activate LLM_data_processing

b. Clone this repository.

git clone git@bitbucket.org:ibricks-rnd/llm_data_preprocess.git
cd llm_data_preprocess

c. Install requirments.

pip install -r requirements.txt

Process instruction dataset

python main.py instruction=version_4.1 main.version=ver_4.1 main.process_type=instruction

Output data format is as follow

{
  "chat_template": List[
    {
      "content": str,
      "role": str  // One of 'system', 'user', 'assistant'
    }
    ...
  ],
  "source": str,
}

In the example:

"prompt" is a list containing dictionaries.
- "content" is a string.
- "role" is a string and can be one of 'system', 'user', or 'assistant'.
"source" is a string that indicates where the data comes from."

If you want to make data for dpo, set save_data_for_dpo as true Note that, this dpo dataset has only prompt and chosen(label). you should build reject yourself.

Process DPO dataset

python main.py dpo=version_2.0_wo_chat main.version=ver_2.0_wo_chat main.process_type=dpo

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
inst		inst
preference		preference
tools		tools
.gitignore		.gitignore
basemodel.py		basemodel.py
exp.py		exp.py
log.txt		log.txt
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Process dataset

Quickstart

Experiment Setting

Process instruction dataset

Process DPO dataset

About

Uh oh!

Releases

Packages

Uh oh!

Languages

yw0nam/process_llm_data

Folders and files

Latest commit

History

Repository files navigation

Process dataset

Quickstart

Experiment Setting

Process instruction dataset

Process DPO dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages