
Sign up to save your podcasts
Or
These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.
TL;DR
In roughly descending order of importance:
---
Outline:
(00:16) TL;DR
(02:23) A Feature Isnt Its Highest Activating Examples
(04:38) Finding Specific Features
(06:02) Architecture - The Classics, but Wider
(07:26) Correlations - Strangely Large?
(09:48) Future Tests
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.
TL;DR
In roughly descending order of importance:
---
Outline:
(00:16) TL;DR
(02:23) A Feature Isnt Its Highest Activating Examples
(04:38) Finding Specific Features
(06:02) Architecture - The Classics, but Wider
(07:26) Correlations - Strangely Large?
(09:48) Future Tests
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,388 Listeners
7,910 Listeners
4,133 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners