
Sign up to save your podcasts
Or
This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort.
Executive Summary
---
Outline:
(00:18) Executive Summary
(02:33) Introduction
(04:55) Evaluating our SAEs for IOI Circuit Analysis
(09:27) Case Study: Applying SAEs for Deeper Understanding of IOI
(11:29) Identifying the Positional Features with SAEs
(13:01) Interpreting the “Positional” Features
(14:35) Confirming the Hypothesis
(16:36) Applying SAEs to QK circuits: S-Inhibition Heads Sometimes Do IO-Boosting
(21:47) Discovering Attention Feature Circuits with Recursive DFA
(22:59) Understanding Recursive DFA
(27:24) Example: Attention feature 3.15566 decomposes into gender and name information
(29:54) Example: Retrieving Dave using recursive attention attribution.
(32:18) Even Cooler Examples: Win a $1000 Bounty!
(33:02) Related Work
(35:09) Limitations
(35:52) Future Work
(38:12) Citing this work
(38:25) Author Contributions Statement
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort.
Executive Summary
---
Outline:
(00:18) Executive Summary
(02:33) Introduction
(04:55) Evaluating our SAEs for IOI Circuit Analysis
(09:27) Case Study: Applying SAEs for Deeper Understanding of IOI
(11:29) Identifying the Positional Features with SAEs
(13:01) Interpreting the “Positional” Features
(14:35) Confirming the Hypothesis
(16:36) Applying SAEs to QK circuits: S-Inhibition Heads Sometimes Do IO-Boosting
(21:47) Discovering Attention Feature Circuits with Recursive DFA
(22:59) Understanding Recursive DFA
(27:24) Example: Attention feature 3.15566 decomposes into gender and name information
(29:54) Example: Retrieving Dave using recursive attention attribution.
(32:18) Even Cooler Examples: Win a $1000 Bounty!
(33:02) Related Work
(35:09) Limitations
(35:52) Future Work
(38:12) Citing this work
(38:25) Author Contributions Statement
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,434 Listeners
2,388 Listeners
7,906 Listeners
4,133 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,429 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners