
Sign up to save your podcasts
Or
Audio note: this article contains 53 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.
*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.
TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.
Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.Introduction
Training to improve capabilities [...]
---
Outline:
(01:27) Introduction
(03:36) Our Experiments
(04:23) Formalizing the Objective
(05:20) Can we solve the problem just by training on our limited alignment data?
(05:55) Seven methods for selective generalization
(07:06) Plotting the capability-alignment tradeoff
(07:26) Preventing Emergent Misalignment
(10:33) Preventing Sycophantic Generalization from an Underspecified Math Dataset
(14:02) Limitations
(14:40) Takeaways
(15:49) Related Work
(17:00) Acknowledgements
(17:22) Appendix
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 53 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.
*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.
TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.
Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.Introduction
Training to improve capabilities [...]
---
Outline:
(01:27) Introduction
(03:36) Our Experiments
(04:23) Formalizing the Objective
(05:20) Can we solve the problem just by training on our limited alignment data?
(05:55) Seven methods for selective generalization
(07:06) Plotting the capability-alignment tradeoff
(07:26) Preventing Emergent Misalignment
(10:33) Preventing Sycophantic Generalization from an Underspecified Math Dataset
(14:02) Limitations
(14:40) Takeaways
(15:49) Related Work
(17:00) Acknowledgements
(17:22) Appendix
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,469 Listeners
2,395 Listeners
7,928 Listeners
4,142 Listeners
89 Listeners
1,472 Listeners
9,207 Listeners
88 Listeners
417 Listeners
5,448 Listeners
15,237 Listeners
481 Listeners
121 Listeners
75 Listeners
461 Listeners