Skip to content

running experiments to explore mechanistic interpretability - inspired by ARENA curriculum

Notifications You must be signed in to change notification settings

Allicai/interp_experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Interpretability Experiments

This repository is a collection of small, exploratory experiments I'm conducting while learning about mechanistic interpretability. Each subdirectory corresponds to a focused question rather than a polished or finalized project.

The goal is to document how I approach open-ended questions, build an intuition for experimenting, and verify theoretical concepts manually.

Current Structure

├── experiments/              # fully fleshed experiments (training, probing, steering, etc.)
│   ├── ioi_induction/
│   └── mini_tasks/           # lightweight, exploratory notebooks, usually working with GPT-2 Medium
└── utils/                    # shared library (hooks, probes, and custom models)

Completed Experiments

  • Induction Head & IOI Task
    • Trained a 2-layer transformer on an Indirect Object Identification (IOI) task (A...B...A -> B) to observe the emergence of induction heads.
    • Used linear probes to visualize the "S-Curve" phase transition where the model learns to copy information from context.
    • Performed causal intervention (activation steering) by injecting embedding vectors (Emb(C) - Emb(B)) to force the model to predict a different token.

Mini-Tasks

  • KV-Cache Memory Audit
    • Inspecting how attention KV-cache memory grows across sequence length and layers by intercepting the forward pass and manually tracking tensor allocation.
  • Activation Steering (Exploratory)
    • Basic interventions on internal activations to observe downstream behavioral changes.
    • Primarily aimed at building intuition for how localized representation edits propagate through the model.

Ideas to Explore Next

  • Path Patching on IOI: Map the exact information flow to isolate circuit logic.
  • Head Ablation Studies: Zero outputs to test component necessity and redundancy.
  • Toy Models of Superposition: Visualize how networks compress many features into few dimensions.
  • Glitch Token Analysis: Trace anomalous inputs to identify specific failure layers.

This work loosely follows the ARENA curriculum, with additional directions driven by personal curiosity.

About

running experiments to explore mechanistic interpretability - inspired by ARENA curriculum

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published