Interpretability Experiments

This repository is a collection of small, exploratory experiments I'm conducting while learning about mechanistic interpretability. Each subdirectory corresponds to a focused question rather than a polished or finalized project.

The goal is to document how I approach open-ended questions, build an intuition for experimenting, and verify theoretical concepts manually.

Current Structure

├── experiments/              # fully fleshed experiments (training, probing, steering, etc.)
│   ├── ioi_induction/
│   └── mini_tasks/           # lightweight, exploratory notebooks, usually working with GPT-2 Medium
└── utils/                    # shared library (hooks, probes, and custom models)

Completed Experiments

Induction Head & IOI Task
- Trained a 2-layer transformer on an Indirect Object Identification (IOI) task (A...B...A -> B) to observe the emergence of induction heads.
- Used linear probes to visualize the "S-Curve" phase transition where the model learns to copy information from context.
- Performed causal intervention (activation steering) by injecting embedding vectors (Emb(C) - Emb(B)) to force the model to predict a different token.

Mini-Tasks

KV-Cache Memory Audit
- Inspecting how attention KV-cache memory grows across sequence length and layers by intercepting the forward pass and manually tracking tensor allocation.
Activation Steering (Exploratory)
- Basic interventions on internal activations to observe downstream behavioral changes.
- Primarily aimed at building intuition for how localized representation edits propagate through the model.

Ideas to Explore Next

Path Patching on IOI: Map the exact information flow to isolate circuit logic.
Head Ablation Studies: Zero outputs to test component necessity and redundancy.
Toy Models of Superposition: Visualize how networks compress many features into few dimensions.
Glitch Token Analysis: Trace anomalous inputs to identify specific failure layers.

This work loosely follows the ARENA curriculum, with additional directions driven by personal curiosity.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
experiments		experiments
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interpretability Experiments

Current Structure

Completed Experiments

Mini-Tasks

Ideas to Explore Next

About

Uh oh!

Releases

Packages

Languages

Allicai/interp_experiments

Folders and files

Latest commit

History

Repository files navigation

Interpretability Experiments

Current Structure

Completed Experiments

Mini-Tasks

Ideas to Explore Next

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages