
Sign up to save your podcasts
Or


Here's a fairly concrete AGI safety proposal:
Default AGI design: Let's suppose we are starting with a pretrained LLM 'base model' and then we are going to do a ton of additional RL ('agency training') to turn it into a general-purpose autonomous agent. So, during training it'll do lots of CoT of 'reasoning' (think like how o1 does it) and then it'll output some text that the user or some external interface sees (e.g. typing into a browser, or a chat window), and then maybe it'll get some external input (the user's reply, etc.) and then the process repeats many times, and then some process evaluates overall performance (by looking at the entire trajectory as well as the final result) and doles out reinforcement.
Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work [...]
---
Outline:
(13:47) Killing Canaries
(16:40) Basins of Attraction
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongHere's a fairly concrete AGI safety proposal:
Default AGI design: Let's suppose we are starting with a pretrained LLM 'base model' and then we are going to do a ton of additional RL ('agency training') to turn it into a general-purpose autonomous agent. So, during training it'll do lots of CoT of 'reasoning' (think like how o1 does it) and then it'll output some text that the user or some external interface sees (e.g. typing into a browser, or a chat window), and then maybe it'll get some external input (the user's reply, etc.) and then the process repeats many times, and then some process evaluates overall performance (by looking at the entire trajectory as well as the final result) and doles out reinforcement.
Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work [...]
---
Outline:
(13:47) Killing Canaries
(16:40) Basins of Attraction
---
First published:
Source:
Narrated by TYPE III AUDIO.

113,527 Listeners

132 Listeners

7,243 Listeners

562 Listeners

16,527 Listeners

4 Listeners

14 Listeners

2 Listeners