
Sign up to save your podcasts
Or
Tl;dr I've decided to shift my research from mechanistic interpretability to more empirical ("prosaic") interpretability / safety work. Here's why.
All views expressed are my own.
What Really Interests Me
I care about understanding how powerful AI systems think internally. I'm drawn to high-level questions ("what are the model's goals / beliefs?") as opposed to low-level mechanics ("how does the model store and use [specific fact]?"). Sure, figuring out how a model does modular addition is cool, but only insofar as those insights and techniques generalise to understanding higher-level reasoning.
Mech Interp Has Been Disappointing
Vis-a-vis answering these high-level conceptual questions, mechanistic interpretability has been disappointing. IOI remains the most interesting circuit we've found in any language model. That's pretty damning. If mechanistic interpretability worked well, we should have already mapped out lots of interesting circuits in open-source 7B models by now.
The field seems conceptually bottlenecked. [...]
---
Outline:
(00:19) What Really Interests Me
(00:46) Mech Interp Has Been Disappointing
(01:25) Doing mech interp research led me to update against it
(02:54) Prosaic Interpretability
(04:11) Intuition pump: Gene analysis for medicine
(05:08) Modern AI Systems will make interpretability difficult
(05:54) The Timing is Frustrating
(06:50) Personal Fit
(08:21) Mech Interp Research that Excites Me
(09:01) Looking Forward
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
Tl;dr I've decided to shift my research from mechanistic interpretability to more empirical ("prosaic") interpretability / safety work. Here's why.
All views expressed are my own.
What Really Interests Me
I care about understanding how powerful AI systems think internally. I'm drawn to high-level questions ("what are the model's goals / beliefs?") as opposed to low-level mechanics ("how does the model store and use [specific fact]?"). Sure, figuring out how a model does modular addition is cool, but only insofar as those insights and techniques generalise to understanding higher-level reasoning.
Mech Interp Has Been Disappointing
Vis-a-vis answering these high-level conceptual questions, mechanistic interpretability has been disappointing. IOI remains the most interesting circuit we've found in any language model. That's pretty damning. If mechanistic interpretability worked well, we should have already mapped out lots of interesting circuits in open-source 7B models by now.
The field seems conceptually bottlenecked. [...]
---
Outline:
(00:19) What Really Interests Me
(00:46) Mech Interp Has Been Disappointing
(01:25) Doing mech interp research led me to update against it
(02:54) Prosaic Interpretability
(04:11) Intuition pump: Gene analysis for medicine
(05:08) Modern AI Systems will make interpretability difficult
(05:54) The Timing is Frustrating
(06:50) Personal Fit
(08:21) Mech Interp Research that Excites Me
(09:01) Looking Forward
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,342 Listeners
2,393 Listeners
7,949 Listeners
4,130 Listeners
87 Listeners
1,446 Listeners
8,756 Listeners
88 Listeners
372 Listeners
5,421 Listeners
15,306 Listeners
468 Listeners
122 Listeners
76 Listeners
447 Listeners