March 06, 2024

[Linkpost] “Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT” by Robert_AIZI

22 minutes

This is a linkpost for https://aizi.substack.com/p/research-report-sparse-autoencodersAbstract.

A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic's Bricken et al). We train a sparse autoencoder on OthelloGPT, a language model trained on transcripts of the board game Othello, which has been shown to contain a linear representation of the board state, findable by supervised probes. The sparse autoencoder finds 9 features which serve as high-accuracy classifiers of the board state, out of 180 findable with supervised probes (and 192 possible piece/position combinations). Across random seeds, the autoencoder repeatedly finds “simpler” features concentrated on the center of the board and the corners. This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model.

Introduction.

There has been a recent [...]

---

Outline:

(02:56) Overview

(03:39) Methods

(03:42) Training OthelloGPT

(04:50) Training Linear Probes

(08:03) Training The Sparse Autoencoder

(09:08) Results

(09:11) SAE Features as Current-Board Classifiers

(12:33) Which Features are Learned?

(14:39) SAE Features as Legal-Move Classifiers

(16:54) Cosine Similarities

(18:03) One Really Cool Case Study

(19:51) Conclusion

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

March 5th, 2024

Source:

https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-report-sparse-autoencoders-find-only-9-180-board

Linkpost URL:
https://aizi.substack.com/p/research-report-sparse-autoencoders

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong

March 06, 2024

[Linkpost] “Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT” by Robert_AIZI

22 minutes

This is a linkpost for https://aizi.substack.com/p/research-report-sparse-autoencodersAbstract.

Introduction.

There has been a recent [...]

---

Outline:

(02:56) Overview

(03:39) Methods

(03:42) Training OthelloGPT

(04:50) Training Linear Probes

(08:03) Training The Sparse Autoencoder

(09:08) Results

(09:11) SAE Features as Current-Board Classifiers

(12:33) Which Features are Learned?

(14:39) SAE Features as Legal-Move Classifiers

(16:54) Cosine Similarities

(18:03) One Really Cool Case Study

(19:51) Conclusion

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

March 5th, 2024

Source:

https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-report-sparse-autoencoders-find-only-9-180-board

Linkpost URL:
https://aizi.substack.com/p/research-report-sparse-autoencoders

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

113,056 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,244 Listeners

Dwarkesh Podcast

535 Listeners

The Ezra Klein Show

16,261 Listeners

AI Article Readings

4 Listeners

Doom Debates

14 Listeners

LessWrong posts by zvi

2 Listeners

Share [Linkpost] “Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT” by Robert_AIZI

Sign up to save your podcasts

[Linkpost] “Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT” by Robert_AIZI

[Linkpost] “Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT” by Robert_AIZI

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates

LessWrong posts by zvi