
Sign up to save your podcasts
Or
This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with support from Neel Nanda and Arthur Conmy. Joseph Bloom is funded by the LTFF, Manifund Regranting Program, donors and LightSpeed Grants. This post makes extensive use of Neuronpedia, a platform for interpretability focusing on accelerating interpretability researchers working with SAEs.
Links: SAEs on HuggingFace, Analysis Code
Executive Summary
This is an informal post sharing statistical methods which can be used to quickly / cheaply better understand Sparse Autoencoder (SAE) features.
---
Outline:
(00:39) Executive Summary
(05:38) Characterizing Features via the Logit Weight Distribution
(09:57) Token Set Enrichment Analysis
(10:26) What is Token Set Enrichment Analysis?
(10:46) Method Steps
(12:18) Case Studies
(24:01) Discussion
(24:04) Limitations
(25:34) Future Work
(29:36) Appendix
(29:39) Thanks
(30:08) How to Cite
(30:14) Glossary
(31:02) Prior Work
(32:37) Token Set Enrichment Analysis: Inspiration and Technical Details
(32:45) Inspiration
(34:16) Technical Details
(35:12) Results by Layer
---
First published:
Source:
Narrated by TYPE III AUDIO.
This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with support from Neel Nanda and Arthur Conmy. Joseph Bloom is funded by the LTFF, Manifund Regranting Program, donors and LightSpeed Grants. This post makes extensive use of Neuronpedia, a platform for interpretability focusing on accelerating interpretability researchers working with SAEs.
Links: SAEs on HuggingFace, Analysis Code
Executive Summary
This is an informal post sharing statistical methods which can be used to quickly / cheaply better understand Sparse Autoencoder (SAE) features.
---
Outline:
(00:39) Executive Summary
(05:38) Characterizing Features via the Logit Weight Distribution
(09:57) Token Set Enrichment Analysis
(10:26) What is Token Set Enrichment Analysis?
(10:46) Method Steps
(12:18) Case Studies
(24:01) Discussion
(24:04) Limitations
(25:34) Future Work
(29:36) Appendix
(29:39) Thanks
(30:08) How to Cite
(30:14) Glossary
(31:02) Prior Work
(32:37) Token Set Enrichment Analysis: Inspiration and Technical Details
(32:45) Inspiration
(34:16) Technical Details
(35:12) Results by Layer
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,389 Listeners
7,910 Listeners
4,136 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
461 Listeners