
Sign up to save your podcasts
Or


These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.
TL;DR
In roughly descending order of importance:
---
Outline:
(00:16) TL;DR
(02:23) A Feature Isnt Its Highest Activating Examples
(04:38) Finding Specific Features
(06:02) Architecture - The Classics, but Wider
(07:26) Correlations - Strangely Large?
(09:48) Future Tests
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongThese are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.
TL;DR
In roughly descending order of importance:
---
Outline:
(00:16) TL;DR
(02:23) A Feature Isnt Its Highest Activating Examples
(04:38) Finding Specific Features
(06:02) Architecture - The Classics, but Wider
(07:26) Correlations - Strangely Large?
(09:48) Future Tests
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,909 Listeners

130 Listeners

7,221 Listeners

535 Listeners

16,221 Listeners

4 Listeners

14 Listeners

2 Listeners