
Sign up to save your podcasts
Or
Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.
First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with sparse autoencoders, though I don't know how directly Anthropic was inspired by that post. I believe both our teams were pleasantly surprised to find out the other one was working on similar lines, serving as a form of replication.
Some disclaimers: This list may be incomplete. I didn't give Anthropic a chance to give feedback on this, so I may have misrepresented some of their work, including by omission. Any mistakes are my own fault.
Source:
https://www.lesswrong.com/posts/F4iogK5xdNd7jDNyw/comparing-anthropic-s-dictionary-learning-to-ours
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
4.8
1111 ratings
Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.
First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with sparse autoencoders, though I don't know how directly Anthropic was inspired by that post. I believe both our teams were pleasantly surprised to find out the other one was working on similar lines, serving as a form of replication.
Some disclaimers: This list may be incomplete. I didn't give Anthropic a chance to give feedback on this, so I may have misrepresented some of their work, including by omission. Any mistakes are my own fault.
Source:
https://www.lesswrong.com/posts/F4iogK5xdNd7jDNyw/comparing-anthropic-s-dictionary-learning-to-ours
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
2,389 Listeners
123 Listeners
4,136 Listeners
87 Listeners
251 Listeners
87 Listeners
389 Listeners
5,438 Listeners
128 Listeners
198 Listeners
121 Listeners
75 Listeners
145 Listeners
121 Listeners
1 Listeners