
Sign up to save your podcasts
Or
TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model's own computations make use of.
Written at Apollo Research
Introduction
Claim: Activation space interpretability is likely to give us features of the activations, not features of the model, and this is a problem.
Let's walk through this claim.
What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In this post, we focus in particular on the problem of decomposing activations, via techniques such as sparse autoencoders (SAEs), PCA, or just by looking at individual neurons. This [...]
---
Outline:
(00:33) Introduction
(02:40) Examples illustrating the general problem
(12:29) The general problem
(13:26) What can we do about this?
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model's own computations make use of.
Written at Apollo Research
Introduction
Claim: Activation space interpretability is likely to give us features of the activations, not features of the model, and this is a problem.
Let's walk through this claim.
What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In this post, we focus in particular on the problem of decomposing activations, via techniques such as sparse autoencoders (SAEs), PCA, or just by looking at individual neurons. This [...]
---
Outline:
(00:33) Introduction
(02:40) Examples illustrating the general problem
(12:29) The general problem
(13:26) What can we do about this?
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,332 Listeners
2,396 Listeners
7,960 Listeners
4,128 Listeners
87 Listeners
1,447 Listeners
8,761 Listeners
88 Listeners
356 Listeners
5,399 Listeners
15,313 Listeners
469 Listeners
123 Listeners
76 Listeners
444 Listeners