
Sign up to save your podcasts
Or


Audio note: this article contains 53 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.
*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.
TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.
Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.Introduction
Training to improve capabilities [...]
---
Outline:
(01:27) Introduction
(03:36) Our Experiments
(04:23) Formalizing the Objective
(05:20) Can we solve the problem just by training on our limited alignment data?
(05:55) Seven methods for selective generalization
(07:06) Plotting the capability-alignment tradeoff
(07:26) Preventing Emergent Misalignment
(10:33) Preventing Sycophantic Generalization from an Underspecified Math Dataset
(14:02) Limitations
(14:40) Takeaways
(15:49) Related Work
(17:00) Acknowledgements
(17:22) Appendix
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong
Audio note: this article contains 53 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.
*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.
TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.
Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.Introduction
Training to improve capabilities [...]
---
Outline:
(01:27) Introduction
(03:36) Our Experiments
(04:23) Formalizing the Objective
(05:20) Can we solve the problem just by training on our limited alignment data?
(05:55) Seven methods for selective generalization
(07:06) Plotting the capability-alignment tradeoff
(07:26) Preventing Emergent Misalignment
(10:33) Preventing Sycophantic Generalization from an Underspecified Math Dataset
(14:02) Limitations
(14:40) Takeaways
(15:49) Related Work
(17:00) Acknowledgements
(17:22) Appendix
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,386 Listeners

2,419 Listeners

8,916 Listeners

4,153 Listeners

92 Listeners

1,595 Listeners

9,862 Listeners

90 Listeners

501 Listeners

5,470 Listeners

16,026 Listeners

539 Listeners

130 Listeners

94 Listeners

504 Listeners