
Sign up to save your podcasts
Or


A short summary of the paper is presented below.
This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) .
TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
Introduction.Current SAEs focus on the wrong goal: They are trained to minimize mean squared reconstruction error [...]
---
Outline:
(04:40) Key Results
(08:13) Acknowledgements
(08:50) Extras
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongA short summary of the paper is presented below.
This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) .
TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
Introduction.Current SAEs focus on the wrong goal: They are trained to minimize mean squared reconstruction error [...]
---
Outline:
(04:40) Key Results
(08:13) Acknowledgements
(08:50) Extras
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,101 Listeners

130 Listeners

7,232 Listeners

576 Listeners

16,129 Listeners

4 Listeners

14 Listeners

2 Listeners