
Sign up to save your podcasts
Or
(thanks to Alex Mallen, Cody Rushing, Zach Stein-Perlman, Hoagy Cunningham, Vlad Mikulik, and Fabien Roger for comments)
Sometimes I hear people argue against AI control as follows: if your control measures rely on getting good judgments from "trusted" AI models, you're doomed because AI models aren't adversarially robust. I agree that better adversarial robustness would definitely help with control, but I think this argument misses several important points; my overall take is that current levels of adversarial robustness don’t come close to dooming AI control.
The first and most important point is that adversarial robustness is normally researched in a setting that's much easier for the adversary. In most discussions, if an adversary can ever get a model to do something bad—even after many attempts and with a low success rate—this counts as a failure. But in real control settings, an AI that tries and fails to jailbreak might [...]
---
Outline:
(04:41) Appendix: What the AI sees when in a defer-to-trusted scaffold
(07:45) Appendix: reasons why imperfect adversarial robustness is bad other than the untrusted model jailbreaking the trusted one
---
First published:
Source:
Narrated by TYPE III AUDIO.
(thanks to Alex Mallen, Cody Rushing, Zach Stein-Perlman, Hoagy Cunningham, Vlad Mikulik, and Fabien Roger for comments)
Sometimes I hear people argue against AI control as follows: if your control measures rely on getting good judgments from "trusted" AI models, you're doomed because AI models aren't adversarially robust. I agree that better adversarial robustness would definitely help with control, but I think this argument misses several important points; my overall take is that current levels of adversarial robustness don’t come close to dooming AI control.
The first and most important point is that adversarial robustness is normally researched in a setting that's much easier for the adversary. In most discussions, if an adversary can ever get a model to do something bad—even after many attempts and with a low success rate—this counts as a failure. But in real control settings, an AI that tries and fails to jailbreak might [...]
---
Outline:
(04:41) Appendix: What the AI sees when in a defer-to-trusted scaffold
(07:45) Appendix: reasons why imperfect adversarial robustness is bad other than the untrusted model jailbreaking the trusted one
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,362 Listeners
2,380 Listeners
7,924 Listeners
4,131 Listeners
87 Listeners
1,447 Listeners
8,922 Listeners
88 Listeners
379 Listeners
5,425 Listeners
15,206 Listeners
475 Listeners
121 Listeners
77 Listeners
455 Listeners