
Sign up to save your podcasts
Or
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com
4.4
88 ratings
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com
26,377 Listeners
2,430 Listeners
1,083 Listeners
107 Listeners
112,351 Listeners
211 Listeners
9,799 Listeners
89 Listeners
489 Listeners
5,468 Listeners
132 Listeners
16,152 Listeners
97 Listeners
209 Listeners
131 Listeners