
Sign up to save your podcasts
Or


This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with support from Neel Nanda and Arthur Conmy. Joseph Bloom is funded by the LTFF, Manifund Regranting Program, donors and LightSpeed Grants. This post makes extensive use of Neuronpedia, a platform for interpretability focusing on accelerating interpretability researchers working with SAEs.
Links: SAEs on HuggingFace, Analysis Code
Executive Summary
This is an informal post sharing statistical methods which can be used to quickly / cheaply better understand Sparse Autoencoder (SAE) features.
---
Outline:
(00:39) Executive Summary
(05:38) Characterizing Features via the Logit Weight Distribution
(09:57) Token Set Enrichment Analysis
(10:26) What is Token Set Enrichment Analysis?
(10:46) Method Steps
(12:18) Case Studies
(24:01) Discussion
(24:04) Limitations
(25:34) Future Work
(29:36) Appendix
(29:39) Thanks
(30:08) How to Cite
(30:14) Glossary
(31:02) Prior Work
(32:37) Token Set Enrichment Analysis: Inspiration and Technical Details
(32:45) Inspiration
(34:16) Technical Details
(35:12) Results by Layer
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongThis work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with support from Neel Nanda and Arthur Conmy. Joseph Bloom is funded by the LTFF, Manifund Regranting Program, donors and LightSpeed Grants. This post makes extensive use of Neuronpedia, a platform for interpretability focusing on accelerating interpretability researchers working with SAEs.
Links: SAEs on HuggingFace, Analysis Code
Executive Summary
This is an informal post sharing statistical methods which can be used to quickly / cheaply better understand Sparse Autoencoder (SAE) features.
---
Outline:
(00:39) Executive Summary
(05:38) Characterizing Features via the Logit Weight Distribution
(09:57) Token Set Enrichment Analysis
(10:26) What is Token Set Enrichment Analysis?
(10:46) Method Steps
(12:18) Case Studies
(24:01) Discussion
(24:04) Limitations
(25:34) Future Work
(29:36) Appendix
(29:39) Thanks
(30:08) How to Cite
(30:14) Glossary
(31:02) Prior Work
(32:37) Token Set Enrichment Analysis: Inspiration and Technical Details
(32:45) Inspiration
(34:16) Technical Details
(35:12) Results by Layer
---
First published:
Source:
Narrated by TYPE III AUDIO.

113,056 Listeners

130 Listeners

7,244 Listeners

531 Listeners

16,261 Listeners

4 Listeners

14 Listeners

2 Listeners