kindergartener language models

Nano Banana prompt: "Logo for a repo with this README: ..."

this repo contains two packages:

pea -- a scalar-valued autograd engine with no dependencies
pealm -- kindergartener language models, from bigrams to gpt2, trained from scratch

`pea`

a scalar-valued autograd engine. no dependencies beyond the python standard library.

implements a Value class that tracks computation graphs and supports reverse-mode autodiff via topological sort. supports +, -, *, /, **, relu, exp, and log.

also includes an nn module with Neuron, Layer, and MLP abstractions, plus softmax, cross_entropy_loss, and one_hot utilities. viz module for drawing computation graphs with graphviz.

from pea import Value

a = Value(2.0)
b = Value(3.0)
c = a * b + b
c.backward()
print(a.grad)  # dc/da = 3.0
print(b.grad)  # dc/db = 2.0 + 1.0 = 3.0

`pealm`

language models and a cli for training, sampling, evaluating, and managing checkpoints. built with pytorch.

models:

character-level bigram -- simple (vocab_size, vocab_size) embedding matrix
character-level transformer -- multi-head self-attention with feedforward layers
gpt2 (124M) -- full gpt2 architecture with support for loading openai pretrained weights

the pealm cli supports:

train -- train any of the above models, with ddp support via torchrun
sample -- generate text from a trained model or pretrained gpt2
eval -- evaluate on hellaswag
convert -- convert full checkpoints to weights-only
clean -- remove training artifacts

sampling with `pealm`

best weights-only checkpoints for all models are saved in the weights directory. you can sample from them using the pealm cli:

# character-level transformer trained on tinyshakespeare
pealm sample char-transformer --ckpt weights/char_transformer/2026-01-27_22-55-53/best.pt --tokenizer-dir weights/char_tokenizer/2026-01-27_22-55-53/ -n 10000

# pretrained gpt2 from openai
pealm sample gpt2-pretrained

# token-level gpt2 trained on tinyshakespeare
pealm sample gpt2 --ckpt weights/gpt2/tinyshakespeare/2026-02-10_18-35-18/best.pt

# token-level gpt2 trained on fineweb edu 10B
pealm sample gpt2 --ckpt weights/gpt2/fineweb_edu10B/2026-02-14_01-57-22/best.pt

distributed training

the train script supports distributed training with torchrun. you can specify the number of nodes and GPUs per node, and torchrun will handle the rest.

GPT2 ddp training

# prepare fineweb_edu10B (one-time)
uv run scripts/fineweb.py

# single node, 2 GPUs
torchrun --standalone --nproc_per_node=2 -m pealm train gpt2

# multi-node (on each node)
torchrun \
    --nnodes=2 \
    --nproc_per_node=4 \
    --node_rank=<0 or 1> \
    --master_addr=<master_ip> \
    --master_port=29500 \
    -m pealm train gpt2

# ddp run on 10B tokens of fineweb edu data, 8x RTX 5090, B=512, microbatch=32, T=1024
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m pealm train gpt2 --input data/fineweb_edu10B --micro-batch-size 32 --save-every 5000 --steps 38146 2>&1 | tee training.log

# monitor GPU utilization and memory usage during training
watch -n 1 'nvidia-smi \
  --query-gpu=index,utilization.gpu,memory.used,memory.total \
  --format=csv && \
  echo && \
  nvidia-smi --query-compute-apps=gpu_uuid,pid,process_name,used_memory \
  --format=csv'

evaluating on hellaswag

# prepare hellaswag data (one-time)
uv run scripts/hellaswag.py

# evaluate pretrained gpt2 on hellaswag
pealm eval gpt2-pretrained

# evaluate custom gpt2 trained on fineweb edu 10B on hellaswag
pealm eval gpt2 --ckpt weights/gpt2/fineweb_edu10B/2026-02-14_01-57-22/best.pt

results

character-level bigram

CharBigram(
  (embedding): Embedding(65, 65)
)

loss curve

losses

	train loss	val loss
before	4.7959	4.7898
2 epochs	2.4525	2.4879

samples

before training:

Aw'TnWDK&OZlAKmSp!-dY-y Z,kAFEArYIfO$$LCiA;N,o.V!ctazfB?SUIeYFo.OpZnkB&3BMwebcQN.qxYhfM,akHYZ3GMlSPr

after 2 epochs:

EEl wofavante lind s:
ANIOUr r VIDUESSow.
Jutod 'sothandora sdeneperen 't:
Yofth ll g's sed liset; i

character-level transformer

loss curve

losses

	val loss
before	4.3081
epoch 1 (best)	2.2701
epoch 5	3.3808

train set loss omitted (takes too long to evaluate).

samples

before training:

;wtEnWDo&OUlAicSp!-di!lAZycWFa;GYtfxf$LC!TTu,okI!ctqLfBlSsIeDFotkpRv&Bi3;Mihbrp&CsxTh MoEkT&j3GllSPrC'lljyoyHW
JqBjEUvAq?MbpComIy,BhM-SBglPHYSp!Ugko&
oSCdwuhpTEyWb-idcy$;lSJxBQmd;Ri?Pq&gynzQOUou;VD:yjWdKuu-,wFBuFusg shnAtlDK?yG,.IhOdMHioJU;XOyO:P3
DCXF&qgo&x;n?&r$veCJqU?Kgl-ILxPFcyoJPhW3WypXNX
BJPTfuUlZ.gfkuKAtYFqinGTX:vVir,&r?LZVt3H&$SvHRZJuwHXp'!qx;KN&h&T WdAEJ&DTT$
',:vYcRg&ClEMg', JEdXuGGri;;i'XQrcILQrFuCEVKXwpi&EmqIGdT-VLer!D?Ov3wOvwHrgj-h-F&PqzMi:Coq? lg-R3Ao'P KgJHicOIouVvogpo&JOY3?PguQkS

sampling at epoch 5.

token-level gpt2

uses the gpt2 tokenizer from tiktoken.

tinyshakespeare

loss curve

notice the shock at each epoch boundary, pretty common for ddp setup with which this model is trained.

results

sampling at epoch 2.

fineweb edu 10B

loss curve

losses

metric	value
min train loss	2.7484
min val loss	3.0218 (step 38000)

best model at step 36500 with val loss 3.0227.

hellaswag evals

model	accuracy	normalized accuracy
pretrained gpt2 (openai)	0.2863	0.2951
gpt2 fineweb edu 10B	0.2929	0.3101

results

sampling at epoch 2.

comparing to openai gpt2 and gpt3

both models share the same gpt2 architecture (12 layers, 12 heads, 768 embed dim, ~124M parameters). the pretrained model is openai's original gpt2 weights from huggingface; the custom model is trained from scratch on tinyshakespeare.

hellaswag evals

+1.5% normalized accuracy improvement on hellaswag for the custom gpt2 trained on fineweb edu 10B compared to the pretrained gpt2. -2.69% behind gpt3 that is trained on 300B tokens of data with 175B parameters (compared to our 20B tokens and 124M parameters).

weight file sizes

model	format	size
pretrained gpt2 (openai)	safetensors (HF cache)	551 MB
gpt2 tinyshakespeare	pytorch (best.pt)	475 MB

sampling throughput

500 tokens generated per run, --no-stream, 5 runs each on an NVIDIA GeForce RTX 3070 Laptop GPU.

model	run 1	run 2	run 3	run 4	run 5	avg (tokens/sec)
pretrained gpt2	81.33	80.44	81.13	81.70	80.42	81.00
gpt2 tinyshakespeare	83.54	83.50	83.40	84.44	84.72	83.92

sample quality

the pretrained model produces coherent, general-domain english (trained on WebText, ~40GB of internet text). the tinyshakespeare model produces shakespeare-flavored text with character names and verse-like structure, but with noticeably more grammatical errors and nonsensical fragments -- expected given it was trained on only ~1MB of shakespeare.

gpt2 optimization benchmarks

progressive optimization techniques applied to gpt2 training, showing throughput improvements (tokens/sec) across different GPU configurations.

RTX 3070 Mobile (B=2, T=1024)

key improvements: bf16 amp gives 2.1x speedup over baseline, flash attention reaches 3.5x speedup, fused adamw with weight decay hits 4.1x speedup.

A30 (B=8, T=1024)

key improvements: tf32 gives 2.8x speedup, torch.compile reaches 6x speedup, flash attention hits 7.8x speedup, fused adamw peaks at 9.3x speedup.

2x H100 DDP training

distributed data parallel training on 2x H100 GPUs (B=512, microbatch=64, T=1024). loss decreases from 11 to 6.5 over 3328 training steps, achieving 900k tokens/sec throughput.

8x RTX 5090 DDP training

1.48M tokens/sec throughput with 8x RTX 5090 GPUs (B=512, microbatch=32, T=1024).

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.agents/skills/read-arxiv-paper		.agents/skills/read-arxiv-paper
.vscode		.vscode
data		data
media		media
notebooks		notebooks
scripts		scripts
src		src
test		test
weights		weights
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierrc		.prettierrc
.python-version		.python-version
README.md		README.md
TODO.md		TODO.md
mise.toml		mise.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

FelixNgFender/pea

Folders and files

Latest commit

History

Repository files navigation

kindergartener language models

pea

pealm

sampling with pealm

distributed training

GPT2 ddp training

evaluating on hellaswag

results

character-level bigram

loss curve

losses

samples

character-level transformer

loss curve

losses

samples

token-level gpt2

tinyshakespeare

loss curve

results

fineweb edu 10B

loss curve

losses

hellaswag evals

results

comparing to openai gpt2 and gpt3

hellaswag evals

weight file sizes

sampling throughput

sample quality

gpt2 optimization benchmarks

RTX 3070 Mobile (B=2, T=1024)

A30 (B=8, T=1024)

2x H100 DDP training

8x RTX 5090 DDP training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`pea`

`pealm`

sampling with `pealm`

Packages