
Sign up to save your podcasts
Or


Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free!
Note 2: I originally sent this link to some people for feedback, but I was having trouble viewing the comments on the draft. The post was also in a reasonably complete state, so I decided to just publish it - and now I can see the comments! If you're one of those people, feedback is still very much welcome!
Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior [...]
---
Outline:
(01:52) Towards a Grand Unifying Theory (GUT) with MI
(03:39) A GUT Needs Paradigms
(04:17) Paradigms Are Instrumental for Progress
(06:23) Three Desiderata for Paradigms
(07:33) Examining Paradigms in Mechanistic Interpretability
(08:07) A Mathematical Framework of a Transformer Circuit
(11:34) The Linear Representation Hypothesis
(15:09) The Superposition Hypothesis
(21:30) Other Bodies of Theory
(21:55) Singular Learning Theory
(25:07) Computational Mechanics
(26:10) The Polytope Hypothesis
(26:33) Distilling A Technical Research Agenda
(29:39) Conclusion
(30:05) Acknowledgements
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongNote: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free!
Note 2: I originally sent this link to some people for feedback, but I was having trouble viewing the comments on the draft. The post was also in a reasonably complete state, so I decided to just publish it - and now I can see the comments! If you're one of those people, feedback is still very much welcome!
Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior [...]
---
Outline:
(01:52) Towards a Grand Unifying Theory (GUT) with MI
(03:39) A GUT Needs Paradigms
(04:17) Paradigms Are Instrumental for Progress
(06:23) Three Desiderata for Paradigms
(07:33) Examining Paradigms in Mechanistic Interpretability
(08:07) A Mathematical Framework of a Transformer Circuit
(11:34) The Linear Representation Hypothesis
(15:09) The Superposition Hypothesis
(21:30) Other Bodies of Theory
(21:55) Singular Learning Theory
(25:07) Computational Mechanics
(26:10) The Polytope Hypothesis
(26:33) Distilling A Technical Research Agenda
(29:39) Conclusion
(30:05) Acknowledgements
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,842 Listeners

130 Listeners

7,215 Listeners

531 Listeners

16,221 Listeners

4 Listeners

14 Listeners

2 Listeners