A minimal GPT-2 inference implementation using NumPy only.
No PyTorch.
No autograd.
No training loop.
Just matrices, attention, and pretrained weights.
This project implements GPT-2 (small, 117M) inference from scratch using NumPy only.
The goal is not performance.
The goal is understanding.
Every major component of GPT-2 is explicitly implemented:
- token & positional embeddings
- causal multi-head self-attention
- MLP (feed-forward network)
- residual connections
- layer normalization
- autoregressive text generation
The model loads official pretrained weights from Hugging Face (safetensors)
and runs inference without any deep learning framework.
- ❌ Not a training implementation
- ❌ Not optimized (no KV cache, O(N²) generation)
- ❌ Not meant for production
- ❌ Not a PyTorch reimplementation in disguise
This is an educational, mechanical view of what a large language model actually does.
Large Language Models often appear “magical”.
This repository removes that illusion.
What remains is:
- linear algebra
- attention
- statistical coherence
There is no reasoning engine here.
No symbols.
No facts.
And yet, semantic consistency emerges.
That tension is the point.
./.venv/bin/python main.py \
"The animal is yellow. It's a cat. It's color is yellow. The color of the cat"The model correctly completes the sentence by inferring that the cat is yellow.
There is no logic rule enforcing this. Only attention maintaining semantic invariants across the sequence.
Text generation is autoregressive and intentionally naïve:
- Tokenize the prompt
- Run a full forward pass over the sequence
- Take the most probable next token (greedy sampling)
- Append it
- Repeat
GPT-2 small:
- Vocabulary size: 50,257
- Context window: 1024 tokens
- Embedding dimension: 768
- Layers: 12
- Attention heads: 12
Each transformer block uses Pre-LayerNorm, exactly like the original GPT-2.
Weight tying between token embeddings and output projection is preserved.
.
├── main.py # Entry point
├── generate.py # Autoregressive generation loop
├── model.py # GPT-2 model
├── layer.py # Transformer block
├── attention.py # Causal multi-head attention
├── mlp.py # Feed-forward network
├── tensor_ops.py # NumPy tensor operations
├── tokenizer.py # GPT-2 tokenizer (no transformers)
├── load.py # Load pretrained weights (safetensors)
└── config.py # Model configuration
Everything is explicit. Nothing is hidden behind a framework.
- Python 3.9+
- NumPy
- tokenizers
- huggingface_hub
- safetensors
It is recommended to use a virtual environment.
Create and activate a .venv:
python3 -m venv .venv
source .venv/bin/activateInstall dependencies:
pip install numpy tokenizers huggingface_hub safetensorsThis project is educational.
It prioritizes:
- clarity
- faithfulness to GPT-2
- conceptual transparency
over:
- speed
- memory efficiency
- scalability
If you want performance, use a real inference engine. If you want understanding, read the code.
Apache 2.0, see LICENSE file