
Sign up to save your podcasts
Or
Thanks to the many people I've chatted with this about over the past many months. And special thanks to Cunningham et al., Marks et al., Joseph Bloom, Trenton Bricken, Adrià Garriga-Alonso, and Johnny Lin, for crucial research artefacts and/or feedback.
Codebase: sparse_circuit_discovery
TL; DR: The residual stream in GPT-2-small, expanded with sparse autoencoders and systematically ablated, looks like the working memory of a forward pass. A few high-magnitude features causally propagate themselves through the model during inference, and these features are interpretable. We can see where in the forward pass, due to which transformer layer, those propagating features are written in and/or scrubbed out.
Introduction
What is GPT-2-small thinking about during an arbitrary forward pass?
I've been trying to isolate legible model circuits using sparse autoencoders. I was inspired by the following example, from the end of Cunningham et al. (2023):
This directed graph is a legible circuit spread across [...]---
Outline:
(00:54) Introduction
(01:46) Related Work
(02:15) Methodology
(03:52) Causal Graphs Key
(05:15) Results
(05:19) Parentheses Example
(06:05) Figure 1
(06:34) Figure 2
(07:12) Figure 3
(08:00) Figure 4
(09:08) Validation of Graph Interpretations
(09:18) Fig. 1 - Parentheticals
(10:19) Fig. 2 - C-words
(11:06) Fig. 3 - C-words
(11:33) Fig. 4 - Acronyms
(12:11) Other Validation
(12:59) Discussion
(13:02) The Residual Stream Is Working Memory...
(13:41) ...And It Is Easily Interpretable.
(14:19) Conclusion
The original text contained 7 footnotes which were omitted from this narration.
The original text contained 3 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
Thanks to the many people I've chatted with this about over the past many months. And special thanks to Cunningham et al., Marks et al., Joseph Bloom, Trenton Bricken, Adrià Garriga-Alonso, and Johnny Lin, for crucial research artefacts and/or feedback.
Codebase: sparse_circuit_discovery
TL; DR: The residual stream in GPT-2-small, expanded with sparse autoencoders and systematically ablated, looks like the working memory of a forward pass. A few high-magnitude features causally propagate themselves through the model during inference, and these features are interpretable. We can see where in the forward pass, due to which transformer layer, those propagating features are written in and/or scrubbed out.
Introduction
What is GPT-2-small thinking about during an arbitrary forward pass?
I've been trying to isolate legible model circuits using sparse autoencoders. I was inspired by the following example, from the end of Cunningham et al. (2023):
This directed graph is a legible circuit spread across [...]---
Outline:
(00:54) Introduction
(01:46) Related Work
(02:15) Methodology
(03:52) Causal Graphs Key
(05:15) Results
(05:19) Parentheses Example
(06:05) Figure 1
(06:34) Figure 2
(07:12) Figure 3
(08:00) Figure 4
(09:08) Validation of Graph Interpretations
(09:18) Fig. 1 - Parentheticals
(10:19) Fig. 2 - C-words
(11:06) Fig. 3 - C-words
(11:33) Fig. 4 - Acronyms
(12:11) Other Validation
(12:59) Discussion
(13:02) The Residual Stream Is Working Memory...
(13:41) ...And It Is Easily Interpretable.
(14:19) Conclusion
The original text contained 7 footnotes which were omitted from this narration.
The original text contained 3 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,433 Listeners
2,387 Listeners
7,897 Listeners
4,131 Listeners
87 Listeners
1,460 Listeners
9,043 Listeners
87 Listeners
389 Listeners
5,434 Listeners
15,176 Listeners
475 Listeners
121 Listeners
75 Listeners
458 Listeners