
Sign up to save your podcasts
Or


In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they’re performing intended cognition, instead of whether they have intended input/output behavior.
In the rest of this post I will:
---
Outline:
(01:41) Cognition-based oversight
(02:10) Discriminating models: a simplification of scalable oversight
(05:20) Cognition-based oversight for discriminating behaviorally identical models
(07:02) Discriminating Behaviorally Identical Classifiers
(11:51) Hard vs. Relaxed DBIC
(13:48) SHIFT as a technique for (hard) DBIC
(19:09) Limitations and next steps
(21:13) Direction 1: better ways to understand interpretable units in deep networks
(23:02) Direction 2: identifying especially leveraged settings for cognition-based oversight
(24:47) Conclusion
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongIn a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they’re performing intended cognition, instead of whether they have intended input/output behavior.
In the rest of this post I will:
---
Outline:
(01:41) Cognition-based oversight
(02:10) Discriminating models: a simplification of scalable oversight
(05:20) Cognition-based oversight for discriminating behaviorally identical models
(07:02) Discriminating Behaviorally Identical Classifiers
(11:51) Hard vs. Relaxed DBIC
(13:48) SHIFT as a technique for (hard) DBIC
(19:09) Limitations and next steps
(21:13) Direction 1: better ways to understand interpretable units in deep networks
(23:02) Direction 2: identifying especially leveraged settings for cognition-based oversight
(24:47) Conclusion
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.

113,004 Listeners

130 Listeners

7,228 Listeners

532 Listeners

16,218 Listeners

4 Listeners

14 Listeners

2 Listeners