
Sign up to save your podcasts
Or
This part-report / part-proposal describes ongoing research, but I'd like to share early results for feedback. I am especially interested in any comment finding mistakes or trivial explanations for these results. I will work on this proposal with a LASR Labs team over the next 3 months. If you are working (or want to work) on something similar I would love to chat!
Experiments and write-up by Stefan, with substantial inspiration and advice from Jake (who doesn’t necessarily endorse every sloppy statement I write). Work produced at Apollo Research.
TL,DR: Toy models of how neural networks compute new features in superposition seem to imply that neural networks that utilize superposition require some form of error correction to avoid interference spiraling out of control. This means small variations along a feature direction shouldn't affect model outputs, which I can test:
---
Outline:
(02:22) Core results and discussion
(06:21) Proposal: Connecting SAEs to model behaviour
(09:19) Conclusion
(11:35) Appendix
(11:38) Methodology
(18:17) Detailed results
(18:21) 1. Activation Plateaus
(20:05) 2. Sensitive directions
(22:02) 3. Local optima in sensitivity
The original text contained 3 footnotes which were omitted from this narration.
The original text contained 7 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
This part-report / part-proposal describes ongoing research, but I'd like to share early results for feedback. I am especially interested in any comment finding mistakes or trivial explanations for these results. I will work on this proposal with a LASR Labs team over the next 3 months. If you are working (or want to work) on something similar I would love to chat!
Experiments and write-up by Stefan, with substantial inspiration and advice from Jake (who doesn’t necessarily endorse every sloppy statement I write). Work produced at Apollo Research.
TL,DR: Toy models of how neural networks compute new features in superposition seem to imply that neural networks that utilize superposition require some form of error correction to avoid interference spiraling out of control. This means small variations along a feature direction shouldn't affect model outputs, which I can test:
---
Outline:
(02:22) Core results and discussion
(06:21) Proposal: Connecting SAEs to model behaviour
(09:19) Conclusion
(11:35) Appendix
(11:38) Methodology
(18:17) Detailed results
(18:21) 1. Activation Plateaus
(20:05) 2. Sensitive directions
(22:02) 3. Local optima in sensitivity
The original text contained 3 footnotes which were omitted from this narration.
The original text contained 7 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,434 Listeners
2,388 Listeners
7,906 Listeners
4,133 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,429 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners