🧠 Decoder-Only Transformer Memorizer (PoC)

The original repository which inspired me to try this idea out is here which used LSTMs but mentioned in the suggestions section that other sequence models could be exploited by overfitting them onto the malicious code.

This project demonstrates a minimal decoder-only Transformer model built in PyTorch that learns to memorize a given program (Python source code). The model is trained to autoregressively generate the entire source file token by token, starting from a special <BOS> token.

✅ Designed as a Proof-of-Concept for low-resource, deterministic memorization.

✨ Project Highlights

🔁 Transformer Encoder used with causal masking to simulate decoder-only behavior (GPT-style)
🧠 Learns to generate entire code files from scratch, given a <BOS> token
🔐 Uses safetensors format for secure, fast model serialization
📦 Self-contained pipeline: training, generation, tokenizer, and dataset
✅ Fully written in pure PyTorch with no external LLM libraries

📁 Project Structure

.
├── example.py             # Source code to memorize
├── train.py               # Training script
├── generate.py            # Text generation script
├── model.py               # Transformer model definition
├── tokenizer.py           # Char-level tokenizer
├── dataset.py             # Dataset for next-token prediction
├── vocab.json             # Auto-generated vocabulary file
├── model.safetensors      # Auto-saved trained model
└── README.md              # You're here

🚀 How It Works

The model is trained to learn P(token_t | token_1, ..., token_{t-1}) by predicting the next character at each position.

It uses:

<BOS>: Begin-of-sequence token
<EOS>: End-of-sequence token
<PAD>: Padding token
<UNK>: Unknown character fallback

⚠️ The tokenizer is character-level, meaning it can memorize any character sequence, not just Python code.

📦 Setup

git clone https://github.com/yourname/transformer-decoder-memorizer.git
cd transformer-decoder-memorizer
pip install torch safetensors

✅ Compatible with CPU or GPU (CUDA automatically used if available)

🏋️‍♂️ Training

Place your target program in example.py.

Then run:

python train.py

This will:

Encode the program
Train the transformer to memorize it
Save the model to model.safetensors
Save the vocabulary to vocab.json

Progress will be logged every 100 epochs.

🔮 Generation

Once training is complete, you can regenerate the file using:

python generate.py

This will:

Load the trained model
Generate the program from <BOS> token
Print it to stdout
Optionally save to out_generated.py

🧪 Example

# example.py

def add(a, b):
    return a + b

After training and running generate.py, you’ll see:

---- Generated ----
def add(a, b):
    return a + b

🧠 Why This Works

Transformer-based LLMs like GPT are decoder-only: they generate text autoregressively. This project replicates that architecture but at a micro-scale, making it ideal for:

PoC experiments
Verifying memory capacity
Pretraining logic on toy datasets
Educational demos

⚙️ Customization

Want to memorize a different file?

Replace example.py with your new target
Re-run train.py
Re-run generate.py

🧱 Future Directions

Add sampling (temperature, top-k)
Use BPE/WordPiece instead of char-level
Train on a corpus of multiple functions
Turn into a code autocompleter from partial input

📝 License

This PoC is MIT licensed. Use it, build on it, or fork it freely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Decoder-Only Transformer Memorizer (PoC)

✨ Project Highlights

📁 Project Structure

🚀 How It Works

📦 Setup

🏋️‍♂️ Training

🔮 Generation

🧪 Example

🧠 Why This Works

⚙️ Customization

🧱 Future Directions

📝 License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
example.py		example.py
generate.py		generate.py
model.py		model.py
model.safetensors		model.safetensors
requirements.txt		requirements.txt
test.py		test.py
tokenizer.py		tokenizer.py
train.py		train.py
vocab.json		vocab.json

License

sortira/transformer-decoder-only-memorisation

Folders and files

Latest commit

History

Repository files navigation

🧠 Decoder-Only Transformer Memorizer (PoC)

✨ Project Highlights

📁 Project Structure

🚀 How It Works

📦 Setup

🏋️‍♂️ Training

🔮 Generation

🧪 Example

🧠 Why This Works

⚙️ Customization

🧱 Future Directions

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages