
Sign up to save your podcasts
Or


A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic's Bricken et al). We train a sparse autoencoder on OthelloGPT, a language model trained on transcripts of the board game Othello, which has been shown to contain a linear representation of the board state, findable by supervised probes. The sparse autoencoder finds 9 features which serve as high-accuracy classifiers of the board state, out of 180 findable with supervised probes (and 192 possible piece/position combinations). Across random seeds, the autoencoder repeatedly finds “simpler” features concentrated on the center of the board and the corners. This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model.
Introduction.There has been a recent [...]
---
Outline:
(02:56) Overview
(03:39) Methods
(03:42) Training OthelloGPT
(04:50) Training Linear Probes
(08:03) Training The Sparse Autoencoder
(09:08) Results
(09:11) SAE Features as Current-Board Classifiers
(12:33) Which Features are Learned?
(14:39) SAE Features as Legal-Move Classifiers
(16:54) Cosine Similarities
(18:03) One Really Cool Case Study
(19:51) Conclusion
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
Linkpost URL:
https://aizi.substack.com/p/research-report-sparse-autoencoders
Narrated by TYPE III AUDIO.
By LessWrongA sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic's Bricken et al). We train a sparse autoencoder on OthelloGPT, a language model trained on transcripts of the board game Othello, which has been shown to contain a linear representation of the board state, findable by supervised probes. The sparse autoencoder finds 9 features which serve as high-accuracy classifiers of the board state, out of 180 findable with supervised probes (and 192 possible piece/position combinations). Across random seeds, the autoencoder repeatedly finds “simpler” features concentrated on the center of the board and the corners. This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model.
Introduction.There has been a recent [...]
---
Outline:
(02:56) Overview
(03:39) Methods
(03:42) Training OthelloGPT
(04:50) Training Linear Probes
(08:03) Training The Sparse Autoencoder
(09:08) Results
(09:11) SAE Features as Current-Board Classifiers
(12:33) Which Features are Learned?
(14:39) SAE Features as Legal-Move Classifiers
(16:54) Cosine Similarities
(18:03) One Really Cool Case Study
(19:51) Conclusion
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
Linkpost URL:
https://aizi.substack.com/p/research-report-sparse-autoencoders
Narrated by TYPE III AUDIO.

113,056 Listeners

130 Listeners

7,244 Listeners

535 Listeners

16,261 Listeners

4 Listeners

14 Listeners

2 Listeners