LessWrong (30+ Karma)

[Linkpost] “Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT” by Robert_AIZI


Listen Later

This is a linkpost for https://aizi.substack.com/p/research-report-sparse-autoencodersAbstract.

A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic's Bricken et al). We train a sparse autoencoder on OthelloGPT, a language model trained on transcripts of the board game Othello, which has been shown to contain a linear representation of the board state, findable by supervised probes. The sparse autoencoder finds 9 features which serve as high-accuracy classifiers of the board state, out of 180 findable with supervised probes (and 192 possible piece/position combinations). Across random seeds, the autoencoder repeatedly finds “simpler” features concentrated on the center of the board and the corners. This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model.

Introduction.

There has been a recent [...]





---

Outline:

(02:56) Overview

(03:39) Methods

(03:42) Training OthelloGPT

(04:50) Training Linear Probes

(08:03) Training The Sparse Autoencoder

(09:08) Results

(09:11) SAE Features as Current-Board Classifiers

(12:33) Which Features are Learned?

(14:39) SAE Features as Legal-Move Classifiers

(16:54) Cosine Similarities

(18:03) One Really Cool Case Study

(19:51) Conclusion

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

March 5th, 2024

Source:

https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-report-sparse-autoencoders-find-only-9-180-board

Linkpost URL:
https://aizi.substack.com/p/research-report-sparse-autoencoders

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,056 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,244 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

535 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,261 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners