November 18, 2024

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris

10 minutes

Subtitle: There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes..

(thanks to Alex Mallen, Cody Rushing, Zach Stein-Perlman, Hoagy Cunningham, Vlad Mikulik, and Fabien Roger for comments)

Sometimes I hear people argue against AI control as follows: if your control measures rely on getting good judgments from "trusted" AI models, you're doomed because AI models aren't adversarially robust. I agree that better adversarial robustness would definitely help with control, but I think this argument misses several important points; my overall take is that current levels of adversarial robustness don’t come close to dooming AI control.

The first and most important point is that adversarial robustness is normally researched in a setting that's much easier for the adversary. In most discussions, if an adversary can ever get a model to do something bad—even after many attempts and with a low success rate—this counts as a failure. But [...]

---

Outline:

(05:12) Appendix: What the AI sees when in a defer-to-trusted scaffold

(08:46) Appendix: reasons why imperfect adversarial robustness is bad other than the untrusted model jailbreaking the trusted one

---

First published:

November 18th, 2024

Source:

https://redwoodresearch.substack.com/p/why-imperfect-adversarial-robustness

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By Redwood Research

November 18, 2024

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris

10 minutes

Subtitle: There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes..

(thanks to Alex Mallen, Cody Rushing, Zach Stein-Perlman, Hoagy Cunningham, Vlad Mikulik, and Fabien Roger for comments)

---

Outline:

(05:12) Appendix: What the AI sees when in a defer-to-trusted scaffold

(08:46) Appendix: reasons why imperfect adversarial robustness is bad other than the untrusted model jailbreaking the trusted one

---

First published:

November 18th, 2024

Source:

https://redwoodresearch.substack.com/p/why-imperfect-adversarial-robustness

---

Narrated by TYPE III AUDIO.

...more

Share “Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris

Sign up to save your podcasts

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck Shlegeris