
Sign up to save your podcasts
Or


This is a link-post for an exciting paper I recently read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.
For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on a lot of examples of aligned behavior.
I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and I'm absolutely delighted to finally see someone else publish another paper on this approach — I'm only sad it has taken so long.
I highly encourage [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrong This is a link-post for an exciting paper I recently read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.
For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on a lot of examples of aligned behavior.
I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and I'm absolutely delighted to finally see someone else publish another paper on this approach — I'm only sad it has taken so long.
I highly encourage [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

26,365 Listeners

2,432 Listeners

8,971 Listeners

4,148 Listeners

92 Listeners

1,595 Listeners

9,913 Listeners

90 Listeners

72 Listeners

5,475 Listeners

16,076 Listeners

536 Listeners

131 Listeners

95 Listeners

515 Listeners