
Sign up to save your podcasts
Or
Suppose you can tell your AI to meet a certain spec (e.g. cure cancer), but most plans that meet the spec are unsafe (e.g. involve killing everyone, or so Rob Bensinger thinks). In these cases, a quantilizer is insufficient for safety due to instrumental convergence.[1]
But suppose we can also give the agent a dispreference for unsafe actions like murder. In effect it has unsafe long-term goals but we control its immediate preferences.[2] When can we get the agent to cure cancer safely rather than murder you? Let's make a model with some unrealistic assumptions.
The murderous shortcut game
This is basically the simplest game in which murder becomes more likely as the task gets longer.
---
Outline:
(00:44) The murderous shortcut game
(03:37) Takeaways
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
Suppose you can tell your AI to meet a certain spec (e.g. cure cancer), but most plans that meet the spec are unsafe (e.g. involve killing everyone, or so Rob Bensinger thinks). In these cases, a quantilizer is insufficient for safety due to instrumental convergence.[1]
But suppose we can also give the agent a dispreference for unsafe actions like murder. In effect it has unsafe long-term goals but we control its immediate preferences.[2] When can we get the agent to cure cancer safely rather than murder you? Let's make a model with some unrealistic assumptions.
The murderous shortcut game
This is basically the simplest game in which murder becomes more likely as the task gets longer.
---
Outline:
(00:44) The murderous shortcut game
(03:37) Takeaways
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,370 Listeners
2,386 Listeners
7,925 Listeners
4,134 Listeners
87 Listeners
1,456 Listeners
9,048 Listeners
87 Listeners
387 Listeners
5,420 Listeners
15,207 Listeners
472 Listeners
120 Listeners
75 Listeners
456 Listeners