Redwood Research Blog

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris


Listen Later

Subtitle: There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes..

(thanks to Alex Mallen, Cody Rushing, Zach Stein-Perlman, Hoagy Cunningham, Vlad Mikulik, and Fabien Roger for comments)

Sometimes I hear people argue against AI control as follows: if your control measures rely on getting good judgments from "trusted" AI models, you're doomed because AI models aren't adversarially robust. I agree that better adversarial robustness would definitely help with control, but I think this argument misses several important points; my overall take is that current levels of adversarial robustness don’t come close to dooming AI control.

The first and most important point is that adversarial robustness is normally researched in a setting that's much easier for the adversary. In most discussions, if an adversary can ever get a model to do something bad—even after many attempts and with a low success rate—this counts as a failure. But [...]

---

Outline:

(05:12) Appendix: What the AI sees when in a defer-to-trusted scaffold

(08:46) Appendix: reasons why imperfect adversarial robustness is bad other than the untrusted model jailbreaking the trusted one

---

First published:

November 18th, 2024

Source:

https://redwoodresearch.substack.com/p/why-imperfect-adversarial-robustness

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

Redwood Research BlogBy Redwood Research