This episode of Inside the Black Box: Cracking AI and Deep Learning explores a new theoretical framework that unifies sparse autoencoders (SAEs), transcoders, and crosscoders — and what it tells us about when mechanistic interpretability actually works.
We start by demystifying these tools and how they use sparse features to uncover internal concepts and computations in large language models, from DNA detectors to deception circuits in Claude 3 Sonnet. Then we introduce the linear representation hypothesis and the geometry of concepts as directions in activation space, along with the challenge of superposition when thousands of concepts must fit into limited dimensions.
Finally, we dive into Tang et al.’s recovery theorems, the compressed sensing roots of their approach, and why these results matter for using SAEs as a reliable “microscope” on model internals, especially in the context of fine-tuning and LoRA experiments. Along the way, we confront the uncomfortable possibility that the linear picture may break down at frontier scales — and what that would mean for the future of interpretability as a safety strategy.