
Sign up to save your podcasts
Or
Is this market really only at 63%? I think you should take the over.
Only 63%? I think you should take the over.Five tiers of rigor for safety-oriented interpretability work
Lately, I have been thinking of interpretability research as falling into five different tiers of rigor.
1. Pontification
This is when researchers claim they have succeeded in interpreting a model by definition or based on analyzing results and asserting hypotheses about them. This is a key part of the scientific method. But by itself, it is not good science. Previously in this sequence, I have argued that this standard is fairly pervasive.
2. Basic Science
This is when researchers develop an interpretation, use it to make some (usually simple) prediction, and then show that this prediction validates. This is at least doing science, but it doesn't necessarily demonstrate any usefulness or value.
3. Streetlight/Toy Demos
This is [...]
---
Outline:
(00:21) Five tiers of rigor for safety-oriented interpretability work
(00:33) 1. Pontification
(00:58) 2. Basic Science
(01:16) 3. Streetlight/Toy Demos
(01:30) 4. Useful Engineering
(01:55) 5. Net Safety Benefit
(02:24) What's been happening lately?
(02:28) Recently, some solid work has been done in tier 3.
(04:28) I think that tier 4 has (barely) been broken into.
(05:08) Current efforts may soon break further into tier 4.
(06:12) What might happen next with SAEs
(06:16) Past predictions
(08:38) New predictions
(11:41) What if we succeed?
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Is this market really only at 63%? I think you should take the over.
Only 63%? I think you should take the over.Five tiers of rigor for safety-oriented interpretability work
Lately, I have been thinking of interpretability research as falling into five different tiers of rigor.
1. Pontification
This is when researchers claim they have succeeded in interpreting a model by definition or based on analyzing results and asserting hypotheses about them. This is a key part of the scientific method. But by itself, it is not good science. Previously in this sequence, I have argued that this standard is fairly pervasive.
2. Basic Science
This is when researchers develop an interpretation, use it to make some (usually simple) prediction, and then show that this prediction validates. This is at least doing science, but it doesn't necessarily demonstrate any usefulness or value.
3. Streetlight/Toy Demos
This is [...]
---
Outline:
(00:21) Five tiers of rigor for safety-oriented interpretability work
(00:33) 1. Pontification
(00:58) 2. Basic Science
(01:16) 3. Streetlight/Toy Demos
(01:30) 4. Useful Engineering
(01:55) 5. Net Safety Benefit
(02:24) What's been happening lately?
(02:28) Recently, some solid work has been done in tier 3.
(04:28) I think that tier 4 has (barely) been broken into.
(05:08) Current efforts may soon break further into tier 4.
(06:12) What might happen next with SAEs
(06:16) Past predictions
(08:38) New predictions
(11:41) What if we succeed?
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,366 Listeners
2,384 Listeners
7,944 Listeners
4,137 Listeners
87 Listeners
1,459 Listeners
9,050 Listeners
88 Listeners
386 Listeners
5,422 Listeners
15,228 Listeners
473 Listeners
120 Listeners
76 Listeners
456 Listeners