
Sign up to save your podcasts
Or
Part 13 of 12 in the Engineer's Interpretability Sequence.
TL;DR
On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.
Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.
Reflecting on predictions
Please see my original post for 10 specific predictions about what today's paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 [...]
---
Outline:
(00:16) TL;DR
(00:57) Reflecting on predictions
(02:09) A review + thoughts
---
First published:
Source:
Narrated by TYPE III AUDIO.
Part 13 of 12 in the Engineer's Interpretability Sequence.
TL;DR
On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.
Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.
Reflecting on predictions
Please see my original post for 10 specific predictions about what today's paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 [...]
---
Outline:
(00:16) TL;DR
(00:57) Reflecting on predictions
(02:09) A review + thoughts
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,388 Listeners
7,910 Listeners
4,133 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners