Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they’re performing intended cognition, instead of whether they have intended input/output behavior.
In the rest of this post I will:
- Articulate a class of approaches to scalable oversight I call cognition-based oversight.
- Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most of the technical difficulty [...]
---
Outline:
(01:41) Cognition-based oversight
(02:10) Discriminating models: a simplification of scalable oversight
(05:20) Cognition-based oversight for discriminating behaviorally identical models
(07:02) Discriminating Behaviorally Identical Classifiers
(11:51) Hard vs. Relaxed DBIC
(13:48) SHIFT as a technique for (hard) DBIC
(19:09) Limitations and next steps
(21:13) Direction 1: better ways to understand interpretable units in deep networks
(23:02) Direction 2: identifying especially leveraged settings for cognition-based oversight
(24:47) Conclusion
The original text contained 8 footnotes which were omitted from this narration.
---