
Sign up to save your podcasts
Or


Part 13 of 12 in the Engineer's Interpretability Sequence.
TL;DR
On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.
Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.
Reflecting on predictions
Please see my original post for 10 specific predictions about what today's paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 [...]
---
Outline:
(00:16) TL;DR
(00:57) Reflecting on predictions
(02:09) A review + thoughts
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongPart 13 of 12 in the Engineer's Interpretability Sequence.
TL;DR
On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.
Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.
Reflecting on predictions
Please see my original post for 10 specific predictions about what today's paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 [...]
---
Outline:
(00:16) TL;DR
(00:57) Reflecting on predictions
(02:09) A review + thoughts
---
First published:
Source:
Narrated by TYPE III AUDIO.

113,026 Listeners

132 Listeners

7,266 Listeners

560 Listeners

16,495 Listeners

4 Listeners

14 Listeners

2 Listeners