
Sign up to save your podcasts
Or


Alignment Pretraining Shows Promise
TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It's now the third paper on this idea, and excitement seems to be building.
How We Got Here
(This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.)
Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models with Human Preferences (Feb ’23).[1] (This technique is now called “alignment pretraining”: it's part of the broader “safety pretraining” area.) Their idea was to give the model plenty of labeled examples of good behavior all the way through pretraining: they showed it was (in small models for simple behaviors) roughly an order of magnitude more effective than various alternatives. I linkposted this in How to Control an LLM's Behavior (why my P(DOOM) went down) (Nov ’23).
There was then a two-year lull in [...]
---
Outline:
(00:13) Alignment Pretraining Shows Promise
(00:37) How We Got Here
(06:42) New Paper Shows Strong Results
(12:24) My Suggested Follow-Ons
(19:46) Reaching Takeoff
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongAlignment Pretraining Shows Promise
TL;DR: A new paper shows that pretraining language models on data about AI behaving well dramatically reduces misaligned behavior, and this effect persists through post-training. The major labs appear to be taking notice. It's now the third paper on this idea, and excitement seems to be building.
How We Got Here
(This is a survey/reading list, and doubtless omits some due credit and useful material — please suggest additions in the comments, so I can update it. Or you can just skip forward to the paper.)
Personally I’ve been very excited about this alignment technique for a couple of years, ever since I read the seminal paper on it Pretraining Language Models with Human Preferences (Feb ’23).[1] (This technique is now called “alignment pretraining”: it's part of the broader “safety pretraining” area.) Their idea was to give the model plenty of labeled examples of good behavior all the way through pretraining: they showed it was (in small models for simple behaviors) roughly an order of magnitude more effective than various alternatives. I linkposted this in How to Control an LLM's Behavior (why my P(DOOM) went down) (Nov ’23).
There was then a two-year lull in [...]
---
Outline:
(00:13) Alignment Pretraining Shows Promise
(00:37) How We Got Here
(06:42) New Paper Shows Strong Results
(12:24) My Suggested Follow-Ons
(19:46) Reaching Takeoff
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

113,081 Listeners

132 Listeners

7,266 Listeners

530 Listeners

16,299 Listeners

4 Listeners

14 Listeners

2 Listeners