
Sign up to save your podcasts
Or
Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paper, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, How to prevent collusion when using untrusted models to monitor each other.) For example:
But I don’t think situations where [...]
---
Outline:
(02:40) The conservative assumption I don't make
(05:36) Hard worlds are plausible
(11:39) Those assumptions are hard to evaluate
(14:57) Why it's nice to rely on these assumptions
(15:08) Making those assumptions makes it methodologically clean to evaluate safety
(16:06) Relying only on easy-to-evaluate assumptions would allow more clearly-specified and hard-to-game regulatory standards
(17:22) Kerckhoff's principle in computer security as an analogy
(18:41) The main downsides of these assumptions
(18:46) You can't completely escape from methodological messiness
(19:50) Bad technique prioritization
(21:02) Insane game theory requirements to evaluate techniques
(21:55) Conclusion
---
First published:
Source:
Narrated by TYPE III AUDIO.
Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paper, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, How to prevent collusion when using untrusted models to monitor each other.) For example:
But I don’t think situations where [...]
---
Outline:
(02:40) The conservative assumption I don't make
(05:36) Hard worlds are plausible
(11:39) Those assumptions are hard to evaluate
(14:57) Why it's nice to rely on these assumptions
(15:08) Making those assumptions makes it methodologically clean to evaluate safety
(16:06) Relying only on easy-to-evaluate assumptions would allow more clearly-specified and hard-to-game regulatory standards
(17:22) Kerckhoff's principle in computer security as an analogy
(18:41) The main downsides of these assumptions
(18:46) You can't completely escape from methodological messiness
(19:50) Bad technique prioritization
(21:02) Insane game theory requirements to evaluate techniques
(21:55) Conclusion
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,321 Listeners
2,401 Listeners
7,891 Listeners
4,115 Listeners
87 Listeners
1,446 Listeners
8,788 Listeners
90 Listeners
356 Listeners
5,370 Listeners
15,284 Listeners
469 Listeners
128 Listeners
72 Listeners
439 Listeners