
Sign up to save your podcasts
Or
This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel.
Executive Summary
---
Outline:
(00:19) Executive Summary
(02:07) Introduction
(04:10) Methodology: Training chat-data specific SAEs
(07:42) Evaluating chat SAEs on the refusal direction reconstruction task
(08:37) Chat data SAEs find more faithful refusal direction reconstructions
(11:11) Chat data SAEs find sparser refusal direction reconstructions
(13:22) Chat data SAEs find more interpretable decompositions of the refusal direction
(16:09) Chat data SAE refusal latents are better for steering
(21:03) The chat dataset is more important than the chat model
(23:44) Related Work
(25:45) Conclusion
(26:32) Limitations
(27:24) Future work
(28:25) Citing this work
(28:36) Author Contributions Statement
(29:23) Acknowledgments
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel.
Executive Summary
---
Outline:
(00:19) Executive Summary
(02:07) Introduction
(04:10) Methodology: Training chat-data specific SAEs
(07:42) Evaluating chat SAEs on the refusal direction reconstruction task
(08:37) Chat data SAEs find more faithful refusal direction reconstructions
(11:11) Chat data SAEs find sparser refusal direction reconstructions
(13:22) Chat data SAEs find more interpretable decompositions of the refusal direction
(16:09) Chat data SAE refusal latents are better for steering
(21:03) The chat dataset is more important than the chat model
(23:44) Related Work
(25:45) Conclusion
(26:32) Limitations
(27:24) Future work
(28:25) Citing this work
(28:36) Author Contributions Statement
(29:23) Acknowledgments
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,359 Listeners
2,382 Listeners
7,947 Listeners
4,135 Listeners
87 Listeners
1,449 Listeners
9,041 Listeners
88 Listeners
377 Listeners
5,420 Listeners
15,180 Listeners
474 Listeners
121 Listeners
77 Listeners
455 Listeners