Skip to content

A minimal GPT-2 inference implementation using NumPy only.

License

Notifications You must be signed in to change notification settings

xigh/numpy-gpt2

Repository files navigation

numpy-gpt2

Python NumPy License Model Purpose

A minimal GPT-2 inference implementation using NumPy only.

No PyTorch.
No autograd.
No training loop.

Just matrices, attention, and pretrained weights.


What this repository is

This project implements GPT-2 (small, 117M) inference from scratch using NumPy only.

The goal is not performance.
The goal is understanding.

Every major component of GPT-2 is explicitly implemented:

  • token & positional embeddings
  • causal multi-head self-attention
  • MLP (feed-forward network)
  • residual connections
  • layer normalization
  • autoregressive text generation

The model loads official pretrained weights from Hugging Face (safetensors)
and runs inference without any deep learning framework.


What this repository is not

  • ❌ Not a training implementation
  • ❌ Not optimized (no KV cache, O(N²) generation)
  • ❌ Not meant for production
  • ❌ Not a PyTorch reimplementation in disguise

This is an educational, mechanical view of what a large language model actually does.


Why this exists

Large Language Models often appear “magical”.

This repository removes that illusion.

What remains is:

  • linear algebra
  • attention
  • statistical coherence

There is no reasoning engine here.
No symbols.
No facts.

And yet, semantic consistency emerges.

That tension is the point.


Example

./.venv/bin/python main.py \
"The animal is yellow. It's a cat. It's color is yellow. The color of the cat"

The model correctly completes the sentence by inferring that the cat is yellow.

There is no logic rule enforcing this. Only attention maintaining semantic invariants across the sequence.


How generation works

Text generation is autoregressive and intentionally naïve:

  1. Tokenize the prompt
  2. Run a full forward pass over the sequence
  3. Take the most probable next token (greedy sampling)
  4. Append it
  5. Repeat

⚠️ For pedagogical clarity, no KV cache is used. This means the entire sequence is recomputed at each step.


Architecture overview

GPT-2 small:

  • Vocabulary size: 50,257
  • Context window: 1024 tokens
  • Embedding dimension: 768
  • Layers: 12
  • Attention heads: 12

Each transformer block uses Pre-LayerNorm, exactly like the original GPT-2.

Weight tying between token embeddings and output projection is preserved.


Code structure

.
├── main.py        # Entry point
├── generate.py    # Autoregressive generation loop
├── model.py       # GPT-2 model
├── layer.py       # Transformer block
├── attention.py   # Causal multi-head attention
├── mlp.py         # Feed-forward network
├── tensor_ops.py  # NumPy tensor operations
├── tokenizer.py   # GPT-2 tokenizer (no transformers)
├── load.py        # Load pretrained weights (safetensors)
└── config.py      # Model configuration

Everything is explicit. Nothing is hidden behind a framework.


Requirements

  • Python 3.9+
  • NumPy
  • tokenizers
  • huggingface_hub
  • safetensors

It is recommended to use a virtual environment.

Create and activate a .venv:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install numpy tokenizers huggingface_hub safetensors

Disclaimer

This project is educational.

It prioritizes:

  • clarity
  • faithfulness to GPT-2
  • conceptual transparency

over:

  • speed
  • memory efficiency
  • scalability

If you want performance, use a real inference engine. If you want understanding, read the code.


License

Apache 2.0, see LICENSE file

About

A minimal GPT-2 inference implementation using NumPy only.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages