LessWrong (30+ Karma)

“EIS XIV: Is mechanistic interpretability about to be practically useful?” by scasper


Listen Later

Is this market really only at 63%? I think you should take the over.

Only 63%? I think you should take the over.

Five tiers of rigor for safety-oriented interpretability work

Lately, I have been thinking of interpretability research as falling into five different tiers of rigor.

1. Pontification

This is when researchers claim they have succeeded in interpreting a model by definition or based on analyzing results and asserting hypotheses about them. This is a key part of the scientific method. But by itself, it is not good science. Previously in this sequence, I have argued that this standard is fairly pervasive.

2. Basic Science

This is when researchers develop an interpretation, use it to make some (usually simple) prediction, and then show that this prediction validates. This is at least doing science, but it doesn't necessarily demonstrate any usefulness or value.

3. Streetlight/Toy Demos

This is [...]

---

Outline:

(00:21) Five tiers of rigor for safety-oriented interpretability work

(00:33) 1. Pontification

(00:58) 2. Basic Science

(01:16) 3. Streetlight/Toy Demos

(01:30) 4. Useful Engineering

(01:55) 5. Net Safety Benefit

(02:24) What's been happening lately?

(02:28) Recently, some solid work has been done in tier 3.

(04:28) I think that tier 4 has (barely) been broken into.

(05:08) Current efforts may soon break further into tier 4.

(06:12) What might happen next with SAEs

(06:16) Past predictions

(08:38) New predictions

(11:41) What if we succeed?

---

First published:

October 11th, 2024

Source:

https://www.lesswrong.com/posts/Es2qzCxhJ8QYsckaA/eis-xiv-is-mechanistic-interpretability-about-to-be

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,858 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,216 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

531 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,173 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners