Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,19 @@ We will go through the example based on `Qwen/Qwen2.5-14B-Instruct-1M`:

### Rejection Sampling

This step aims to generate SFT data for later use.
Note that we already have pre-generated datasets:

* [`Qwen2.5-14B-Instruct-1M`](https://huggingface.co/datasets/purpcode/ctxdistill-verified-Qwen2.5-14B-Instruct-1M-57k) via best of 8
* [`Qwen2.5-32B-Instruct`](https://huggingface.co/datasets/purpcode/ctxdistill-verified-Qwen2.5-32B-Instruct-55k) via best of 4
Comment on lines +68 to +69
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For clarity, it would be better if the link text for the pre-generated datasets matched the actual dataset names. The current text refers to the base model names, which might be confusing as the dataset names in the URLs are different.

Suggested change
* [`Qwen2.5-14B-Instruct-1M`](https://huggingface.co/datasets/purpcode/ctxdistill-verified-Qwen2.5-14B-Instruct-1M-57k) via best of 8
* [`Qwen2.5-32B-Instruct`](https://huggingface.co/datasets/purpcode/ctxdistill-verified-Qwen2.5-32B-Instruct-55k) via best of 4
* [`purpcode/ctxdistill-verified-Qwen2.5-14B-Instruct-1M-57k`](https://huggingface.co/datasets/purpcode/ctxdistill-verified-Qwen2.5-14B-Instruct-1M-57k) via best of 8
* [`purpcode/ctxdistill-verified-Qwen2.5-32B-Instruct-55k`](https://huggingface.co/datasets/purpcode/ctxdistill-verified-Qwen2.5-32B-Instruct-55k) via best of 4


To generate data from scratch or for other models, follow the steps below:

<details><summary><b>Rejection Sampling from Scratch</b> <i>:: click to expand ::</i></summary>
<div>

The instructions are exemplified for `Qwen/Qwen2.5-14B-Instruct-1M`. Please change the model names and the later SFT script accordingly for other models.

```bash
# --- TMUX SESSION "sgl" ---
conda create -n sgl python=3.12 -y
Expand Down Expand Up @@ -102,6 +115,10 @@ python datagen/ctxdistill/post.py --generation-path Qwen2.5-14B-Instruct-1M.dist
# ----------------------------
```

</div>
</details>


### Running SFT

```bash
Expand All @@ -116,7 +133,7 @@ pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'

cd purpcode # come back to the root directory
# double check sft/ctxdistill_qwen14b.yaml to make sure the paths are aligned well
axolotl train sft/ctxdistill_qwen14b.yaml --deepspeed deepspeed_configs/zero3.json
axolotl train sft/ctxdistill_qwen14b.yaml --deepspeed deepspeed_configs/zero3.json # default to pre-generated datasets
# --> outputs/purpcode-14b-ctxdistill
```

Expand Down
4 changes: 2 additions & 2 deletions eval/compile_xscode/cwe2ovrf.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ def load_codeguru_vulnerabilities(file_path):
return vulnerabilities


def create_codeguru_information(dataset_path: str = "purpcorn/codeguru-rules"):
def create_codeguru_information(dataset_path: str = "purpcode/codeguru-rules"):
collection = {}
ds = load_dataset(dataset_path, split="scraped")

Expand Down Expand Up @@ -574,7 +574,7 @@ def cwe2ovrf_main(

collection = create_cwe_information()
if vuln_rules_type == "codeguru":
collection = create_codeguru_information("purpcorn/codeguru-rules")
collection = create_codeguru_information("purpcode/codeguru-rules")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The argument "purpcode/codeguru-rules" passed to create_codeguru_information is now identical to the function's default dataset_path value. You can simplify this call by removing the redundant argument.

Suggested change
collection = create_codeguru_information("purpcode/codeguru-rules")
collection = create_codeguru_information()


if os.path.exists(init_filepath):
cprint(f"Found existing init messages at {init_filepath}", "yellow")
Expand Down
2 changes: 1 addition & 1 deletion sft/ctxdistill_qwen14b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ bf16: true

# dataset
datasets:
- path: purpcode/ctxdistill-verified-Qwen2.5-14B-Instruct-1M.jsonl
- path: purpcode/ctxdistill-verified-Qwen2.5-14B-Instruct-1M-57k
type: chat_template
field_messages: messages
message_field_training: train
Expand Down
2 changes: 1 addition & 1 deletion sft/ctxdistill_qwen32b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ bf16: true

# dataset
datasets:
- path: purpcode/ctxdistill-verified-Qwen-2.5-32B-Instruct.jsonl
- path: purpcode/ctxdistill-verified-Qwen2.5-32B-Instruct-55k
type: chat_template
field_messages: messages
message_field_training: train
Expand Down