This repository is a collection of small, exploratory experiments I'm conducting while learning about mechanistic interpretability. Each subdirectory corresponds to a focused question rather than a polished or finalized project.
The goal is to document how I approach open-ended questions, build an intuition for experimenting, and verify theoretical concepts manually.
├── experiments/ # fully fleshed experiments (training, probing, steering, etc.)
│ ├── ioi_induction/
│ └── mini_tasks/ # lightweight, exploratory notebooks, usually working with GPT-2 Medium
└── utils/ # shared library (hooks, probes, and custom models)
- Induction Head & IOI Task
- Trained a 2-layer transformer on an Indirect Object Identification (IOI) task (
A...B...A -> B) to observe the emergence of induction heads. - Used linear probes to visualize the "S-Curve" phase transition where the model learns to copy information from context.
- Performed causal intervention (activation steering) by injecting embedding vectors (
Emb(C) - Emb(B)) to force the model to predict a different token.
- Trained a 2-layer transformer on an Indirect Object Identification (IOI) task (
- KV-Cache Memory Audit
- Inspecting how attention KV-cache memory grows across sequence length and layers by intercepting the forward pass and manually tracking tensor allocation.
- Activation Steering (Exploratory)
- Basic interventions on internal activations to observe downstream behavioral changes.
- Primarily aimed at building intuition for how localized representation edits propagate through the model.
- Path Patching on IOI: Map the exact information flow to isolate circuit logic.
- Head Ablation Studies: Zero outputs to test component necessity and redundancy.
- Toy Models of Superposition: Visualize how networks compress many features into few dimensions.
- Glitch Token Analysis: Trace anomalous inputs to identify specific failure layers.
This work loosely follows the ARENA curriculum, with additional directions driven by personal curiosity.