
Sign up to save your podcasts
Or


These are my personal views. Thank you to Ryan Greenblatt, Holden Karnofsky, and Peter Wildeford for useful discussions. The bad takes are my own.
When deciding how much to spend on mitigating a vulnerability that a competent scheming AI might exploit, you might be tempted to use E[damages | AI decides to take advantage of the vulnerability] to decide how important mitigating that vulnerability is.
But this misses how a strategic scheming AI might decide to not take advantage of some vulnerabilities, because being caught could reduce how likely it would be to take advantage of better future opportunities. So instead you might want to focus on making net-sabotage-value negative:
Focus on mitigating vulnerabilities that are net-positive from the scheming AI's perspective, that is if the net-sabotage-value E[damages - value to humans from catching the AI | AI decides to take advantage of the vulnerability] is positive. Down-weight the importance of those for which the net-sabotage-value is negative.
Here is an illustration that paints a very simplified picture of the situation:
In practice, you should often be uncertain about whether a vulnerability has negative net-sabotage-value and whether AIs know this has negative net-sabotage-value, and mitigating sabotage risk from [...]
---
Outline:
(02:34) The argument
(04:32) Main limitations
(05:48) Potential implication 1: Simple monitoring might be enough against early competent scheming AIs
(06:23) Toy case study: Default mitigations are all you need against superhuman coder persuasion?
(08:17) Potential implication 2: Your willingness to pay for control might be much higher if alignment becomes easy
(09:52) Toy case study: Superhuman coder persuasion mitigations can become more important when you get good alignment news
(11:05) Potential implication 3: Risk management advice might apply less than you think
(12:34) Potential implication 4: You might in principle not want to make AI control mitigations too good
(15:31) More generally, unconditional payments to AIs might be more tractable than you think
(16:57) Should you ignore negative net-sabotage-value vulnerabilities? (No)
(18:38) Toy case study: Advanced mitigations are still useful against non-scheming superhuman coder persuasion
(19:31) Final note: don't use this argument for wildly superintelligent AIs
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThese are my personal views. Thank you to Ryan Greenblatt, Holden Karnofsky, and Peter Wildeford for useful discussions. The bad takes are my own.
When deciding how much to spend on mitigating a vulnerability that a competent scheming AI might exploit, you might be tempted to use E[damages | AI decides to take advantage of the vulnerability] to decide how important mitigating that vulnerability is.
But this misses how a strategic scheming AI might decide to not take advantage of some vulnerabilities, because being caught could reduce how likely it would be to take advantage of better future opportunities. So instead you might want to focus on making net-sabotage-value negative:
Focus on mitigating vulnerabilities that are net-positive from the scheming AI's perspective, that is if the net-sabotage-value E[damages - value to humans from catching the AI | AI decides to take advantage of the vulnerability] is positive. Down-weight the importance of those for which the net-sabotage-value is negative.
Here is an illustration that paints a very simplified picture of the situation:
In practice, you should often be uncertain about whether a vulnerability has negative net-sabotage-value and whether AIs know this has negative net-sabotage-value, and mitigating sabotage risk from [...]
---
Outline:
(02:34) The argument
(04:32) Main limitations
(05:48) Potential implication 1: Simple monitoring might be enough against early competent scheming AIs
(06:23) Toy case study: Default mitigations are all you need against superhuman coder persuasion?
(08:17) Potential implication 2: Your willingness to pay for control might be much higher if alignment becomes easy
(09:52) Toy case study: Superhuman coder persuasion mitigations can become more important when you get good alignment news
(11:05) Potential implication 3: Risk management advice might apply less than you think
(12:34) Potential implication 4: You might in principle not want to make AI control mitigations too good
(15:31) More generally, unconditional payments to AIs might be more tractable than you think
(16:57) Should you ignore negative net-sabotage-value vulnerabilities? (No)
(18:38) Toy case study: Advanced mitigations are still useful against non-scheming superhuman coder persuasion
(19:31) Final note: don't use this argument for wildly superintelligent AIs
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

113,081 Listeners

132 Listeners

7,271 Listeners

530 Listeners

16,299 Listeners

4 Listeners

14 Listeners

2 Listeners