
Sign up to save your podcasts
Or
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com
4.4
88 ratings
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com
26,370 Listeners
2,407 Listeners
1,863 Listeners
294 Listeners
107 Listeners
4,119 Listeners
90 Listeners
297 Listeners
91 Listeners
426 Listeners
252 Listeners
87 Listeners
60 Listeners
144 Listeners
124 Listeners