
Sign up to save your podcasts
Or
Redwood (where Ryan works) recently released a series of blogposts proposing a framing and research agenda for reducing AI-risk that focuses on ensuring safety (and secondarily usefulness) under the conservative assumption that AIs are misaligned and actively scheming against human interests, under the name "AI Control".
This is in contrast to other work on AI risk which focuses on reducing the probability that AI systems pursue goals that are in conflict with human values in the first place (which might include having it not pursue goals in the relevant sense at all), usually called "AI Alignment". In other words, control aims to ensure that even if your models are actively misaligned, you'll be safe, because they are not capable of subverting your safety measures.
In this dialogue we dig into our disagreements on the degree to which this kind of work seems promising, and whether/how this reframing opens [...]
---
Outline:
(01:23) The Case for Control Work
(22:48) Problems with training against bad behavior, sample efficiency, and exploration-hacking
(53:50) Aside on trading with AI systems
(57:12) How easily can you change the goals of an AI system with training?
(01:03:05) Appendix: Other comments on Ryans shortform post
---
First published:
Source:
Narrated by TYPE III AUDIO.
Redwood (where Ryan works) recently released a series of blogposts proposing a framing and research agenda for reducing AI-risk that focuses on ensuring safety (and secondarily usefulness) under the conservative assumption that AIs are misaligned and actively scheming against human interests, under the name "AI Control".
This is in contrast to other work on AI risk which focuses on reducing the probability that AI systems pursue goals that are in conflict with human values in the first place (which might include having it not pursue goals in the relevant sense at all), usually called "AI Alignment". In other words, control aims to ensure that even if your models are actively misaligned, you'll be safe, because they are not capable of subverting your safety measures.
In this dialogue we dig into our disagreements on the degree to which this kind of work seems promising, and whether/how this reframing opens [...]
---
Outline:
(01:23) The Case for Control Work
(22:48) Problems with training against bad behavior, sample efficiency, and exploration-hacking
(53:50) Aside on trading with AI systems
(57:12) How easily can you change the goals of an AI system with training?
(01:03:05) Appendix: Other comments on Ryans shortform post
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,389 Listeners
7,910 Listeners
4,136 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
461 Listeners