Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: Python

on:
push:
branches:
- main
pull_request:
branches:
- main

jobs:
test:
runs-on: ${{ matrix.os }}

strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: [3.11, 3.12]

steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt

- name: Run tests
run: pytest
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

[![GitBook](https://img.shields.io/badge/GitBook-从零入门大模型-blue)](https://jingedawang.gitbook.io/tutorialllm)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jingedawang/TutorialLLM/blob/main/TutorialLLM.ipynb)
[![Python](https://github.com/jingedawang/TutorialLLM/actions/workflows/python.yml/badge.svg)](https://github.com/jingedawang/TutorialLLM/actions/workflows/python.yml)

This tutorial will guide you through the process of training a large language model (LLM) step
by step. For educational purposes, we choosed a small dataset and a small model, but the basic principles we want to convey is the same with larger models.
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
numpy
pytest
torch
175 changes: 175 additions & 0 deletions tests/test_data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
[
{
"title": "帝京篇十首 一",
"author": "太宗皇帝",
"paragraphs": [
"秦川雄帝宅,函谷壯皇居。",
"綺殿千尋起,離宮百雉餘。",
"連甍遙接漢,飛觀迥凌虛。",
"雲日隱層闕,風煙出綺疎。"
]
},
{
"title": "帝京篇十首 二",
"author": "太宗皇帝",
"paragraphs": [
"巖廊罷機務,崇文聊駐輦。",
"玉匣啓龍圖,金繩披鳳篆。",
"韋編斷仍續,縹帙舒還卷。",
"對此乃淹留,欹案觀墳典。"
]
},
{
"title": "帝京篇十首 三",
"author": "太宗皇帝",
"paragraphs": [
"移步出詞林,停輿欣武宴。",
"琱弓寫明月,駿馬疑流電。",
"驚雁落虛弦,啼猿悲急箭。",
"閱賞誠多美,於茲乃忘倦。"
]
},
{
"title": "寒雨朝行視園樹",
"author": "杜甫",
"paragraphs": [
"柴門雜樹向千株,丹橘黃甘此地無。",
"江上今朝寒雨歇,籬中秀色畫屏紆。",
"桃蹊李徑年雖故,梔子紅椒豔復殊。",
"鏁石藤稍元自落,倚天松骨見來枯。",
"林香出實垂將盡,葉蔕辭枝不重蘇。",
"愛日恩光蒙借貸,清霜殺氣得憂虞。",
"衰顏更覓藜床坐,緩步仍須竹杖扶。",
"散騎未知雲閣處,啼猨僻在楚山隅。"
]
},
{
"title": "帝京篇十首 五",
"author": "太宗皇帝",
"paragraphs": [
"芳辰追逸趣,禁苑信多奇。",
"橋形通漢上,峰勢接雲危。",
"煙霞交隱映,花鳥自參差。",
"何如肆轍跡?萬里賞瑤池。"
]
},
{
"title": "帝京篇十首 六",
"author": "太宗皇帝",
"paragraphs": [
"飛蓋去芳園,蘭橈遊翠渚。",
"萍間日彩亂,荷處香風舉。",
"桂楫滿中川,弦歌振長嶼。",
"豈必汾河曲,方爲歡宴所。"
]
},
{
"title": "帝京篇十首 七",
"author": "太宗皇帝",
"paragraphs": [
"落日雙闕昏,回輿九重暮。",
"長煙散初碧,皎月澄輕素。",
"搴幌翫琴書,開軒引雲霧。",
"斜漢耿層閣,清風搖玉樹。"
]
},
{
"title": "帝京篇十首 八",
"author": "太宗皇帝",
"paragraphs": [
"歡樂難再逢,芳辰良可惜。",
"玉酒泛雲罍,蘭殽陳綺席。",
"千鍾合堯禹,百獸諧金石。",
"得志重寸陰,忘懷輕尺璧。"
]
},
{
"title": "帝京篇十首 九",
"author": "太宗皇帝",
"paragraphs": [
"建章歡賞夕,二八盡妖妍。",
"羅綺昭陽殿,芬芳玳瑁筵。",
"珮移星正動,扇掩月初圓。",
"無勞上懸圃,即此對神仙。"
]
},
{
"title": "相和歌辭 從軍行",
"author": "張祜",
"paragraphs": [
"少年金紫就光輝,直指邊城虎翼飛。",
"一卷旌收千騎虜,萬全身出百重圍。",
"黃雲斷塞尋鷹去,白草連天射雁歸。",
"白首漢廷刀筆吏,丈夫功業本相依。"
]
},
{
"title": "贈別上元主簿張著",
"author": "韓翃",
"paragraphs": [
"上書一見平津侯,劒笏斜齊秣陵尉。",
"朝垂綬帶迎遠客,暮鎖印囊飛上吏。",
"長樂花深萬井時,同官無事有歸期。",
"回船對酒三生渚,繫馬焚香五願祠。",
"日日澄江帶山翠,綠芳都在經過地。",
"行人看射領軍堂,遊女題詩光宅寺。",
"風流才調愛君偏,此別相逢定幾年。",
"惆悵浮雲迷遠道,張侯樓上月娟娟。"
]
},
{
"title": "別氾水縣尉",
"author": "韓翃",
"paragraphs": [
"未央宮殿金開鑰,詔引賢良卷珠箔。",
"花間賜食近丹墀,烟裏揮毫對青閣。",
"萬年枝影轉斜光,三道先成君激昂。",
"谷水直言身不顧,郄詵高第轉名香。",
"綠槐陰陰出關道,上有蟬聲下秋草。",
"奴子平頭駿馬肥,少年白皙登王畿。",
"五侯客舍偏留宿,一縣人家爭看歸。",
"南向千峯北臨水,佳期賞地應窮此。",
"賦詩或送鄭行人,舉酒常陪魏公子。",
"自憐寂寞會君稀,猶著前時博士衣。",
"我欲低眉問知己,若將無用廢東歸。"
]
},
{
"title": "送從翁赴任長子縣令",
"author": "權德輿",
"paragraphs": [
"家風本鉅儒,吏職化雙鳧。",
"啓事才方愜,臨人政自殊。",
"地雄韓上黨,秩比魯中都。",
"拜首春郊夕,離杯莫向隅。"
]
},
{
"title": "送從弟廣東歸絕句",
"author": "權德輿",
"paragraphs": [
"夏雲如火鑠晨輝,欵段羸車整素衣。",
"知爾業成還出谷,今朝莫愴斷行飛。"
]
},
{
"title": "送王鍊師赴王屋洞",
"author": "權德輿",
"paragraphs": [
"稔歲在芝田,歸程入洞天。",
"白雲辭上國,青鳥會羣仙。",
"自以碁銷日,寧資藥駐年。",
"相看話離合,風馭忽泠然。"
]
},
{
"title": "送薛溫州",
"author": "權德輿",
"paragraphs": [
"昨日饋連營,今來刺列城。",
"方期建禮直,忽訪永嘉程。",
"郡內裁詩暇,樓中遲客情。",
"憑君減千騎,莫遣海鷗驚。"
]
}
]
46 changes: 46 additions & 0 deletions tests/test_run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import os
import sys
import torch

sys.path.append(os.path.join(os.path.dirname(__file__), '..'))

from dataset import Dataset
from evaluator import Evaluator
from model import TutorialLLM
from trainer import Trainer

def test_run():
"""
Test the overal pipeline runs without error.
"""
batch_size = 1
max_length = 32
device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.manual_seed(2024)
dataset = Dataset('data.json', batch_size, max_length, device)
dataset.finetune_train_data = dataset.finetune_train_data[:10]
dataset.finetune_evaluate_data = dataset.finetune_evaluate_data[:10]
dataset.alignment_train_data = dataset.alignment_train_data[:10]
dataset.alignment_evaluate_data = dataset.alignment_evaluate_data[:10]

dim_embedding = 8
num_head = 1
num_layer = 2
model = TutorialLLM(dataset.vocabulary_size, dim_embedding, max_length, num_head, num_layer, device)
model.train()
model.to(device)
iterations_to_evaluate_pretrain = 10
interval_to_evaluate_pretrain = 10
interval_to_evaluate_finetune = 10
interval_to_evaluate_alignment = 10
evaluator = Evaluator(dataset, device, iterations_to_evaluate_pretrain, interval_to_evaluate_pretrain, interval_to_evaluate_finetune, interval_to_evaluate_alignment)
trainer = Trainer(model, dataset, evaluator, device)

iterations_for_pretrain = 10
trainer.pretrain(iterations_for_pretrain)

epochs_for_finetune = 1
trainer.finetune(epochs_for_finetune)

epochs_for_alignment = 1
trainer.align(epochs_for_alignment)
Loading