
Sign up to save your podcasts
Or


What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com
By Daniel Filan4.4
88 ratings
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html
Topics we discuss, and timestamps:
0:00:41 APD basics
0:07:57 Faithfulness
0:11:10 Minimality
0:28:44 Simplicity
0:34:50 Concrete-ish examples of APD
0:52:00 Which parts of APD are canonical
0:58:10 Hyperparameter selection
1:06:40 APD in toy models of superposition
1:14:40 APD and compressed computation
1:25:43 Mechanisms vs representations
1:34:41 Future applications of APD?
1:44:19 How costly is APD?
1:49:14 More on minimality training
1:51:49 Follow-up work
2:05:24 APD on giant chain-of-thought models?
2:11:27 APD and "features"
2:14:11 Following Lee's work
Lee links (Leenks):
X/Twitter: https://twitter.com/leedsharkey
Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey
Research we discuss:
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476
Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
Episode art by Hamish Doodles: hamishdoodles.com

26,377 Listeners

2,430 Listeners

1,083 Listeners

107 Listeners

112,351 Listeners

211 Listeners

9,799 Listeners

89 Listeners

489 Listeners

5,468 Listeners

132 Listeners

16,152 Listeners

97 Listeners

209 Listeners

131 Listeners