Thanks to the many people I've chatted with this about over the past many months. And special thanks to Cunningham et al., Marks et al., Joseph Bloom, Trenton Bricken, Adrià Garriga-Alonso, and Johnny Lin, for crucial research artefacts and/or feedback.
Codebase: sparse_circuit_discovery
TL; DR: The residual stream in GPT-2-small, expanded with sparse autoencoders and systematically ablated, looks like the working memory of a forward pass. A few high-magnitude features causally propagate themselves through the model during inference, and these features are interpretable. We can see where in the forward pass, due to which transformer layer, those propagating features are written in and/or scrubbed out.
Introduction
What is GPT-2-small thinking about during an arbitrary forward pass?
I've been trying to isolate legible model circuits using sparse autoencoders. I was inspired by the following example, from the end of Cunningham et al. (2023):
This directed graph is a legible circuit spread across [...]
---
Outline:
(00:54) Introduction
(01:46) Related Work
(02:15) Methodology
(03:52) Causal Graphs Key
(05:15) Results
(05:19) Parentheses Example
(06:05) Figure 1
(06:34) Figure 2
(07:12) Figure 3
(08:00) Figure 4
(09:08) Validation of Graph Interpretations
(09:18) Fig. 1 - Parentheticals
(10:19) Fig. 2 - C-words
(11:06) Fig. 3 - C-words
(11:33) Fig. 4 - Acronyms
(12:11) Other Validation
(12:59) Discussion
(13:02) The Residual Stream Is Working Memory...
(13:41) ...And It Is Easily Interpretable.
(14:19) Conclusion
The original text contained 7 footnotes which were omitted from this narration.
The original text contained 3 images which were described by AI.
---