
Sign up to save your podcasts
Or


Summary
Read the full paper here.
Introduction
LLMs can have undesired out-of-distribution (OOD) generalization from their fine-tuning data. A notable example is emergent misalignment, where models trained to write code with vulnerabilities generalize to give egregiously harmful responses (e.g. recommending user self-harm) to OOD evaluation questions.
Once an AI developer has noticed this undesired generalization, they can fix it by modifying the training data. In [...]
---
Outline:
(00:15) Summary
(01:00) Introduction
(03:12) How CAFT works
(05:21) Results
(05:24) Mitigating emergent misalignment
(07:58) Reducing sensitivity to spurious correlations
(09:35) Limitations
(11:24) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongSummary
Read the full paper here.
Introduction
LLMs can have undesired out-of-distribution (OOD) generalization from their fine-tuning data. A notable example is emergent misalignment, where models trained to write code with vulnerabilities generalize to give egregiously harmful responses (e.g. recommending user self-harm) to OOD evaluation questions.
Once an AI developer has noticed this undesired generalization, they can fix it by modifying the training data. In [...]
---
Outline:
(00:15) Summary
(01:00) Introduction
(03:12) How CAFT works
(05:21) Results
(05:24) Mitigating emergent misalignment
(07:58) Reducing sensitivity to spurious correlations
(09:35) Limitations
(11:24) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,386 Listeners

2,419 Listeners

8,916 Listeners

4,153 Listeners

92 Listeners

1,595 Listeners

9,862 Listeners

90 Listeners

501 Listeners

5,470 Listeners

16,026 Listeners

539 Listeners

130 Listeners

94 Listeners

504 Listeners