
Sign up to save your podcasts
Or


Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paper, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, How to prevent collusion when using untrusted models to monitor each other.) For example:
But I don’t think situations where [...]
---
Outline:
(02:40) The conservative assumption I don't make
(05:36) Hard worlds are plausible
(11:39) Those assumptions are hard to evaluate
(14:57) Why it's nice to rely on these assumptions
(15:08) Making those assumptions makes it methodologically clean to evaluate safety
(16:06) Relying only on easy-to-evaluate assumptions would allow more clearly-specified and hard-to-game regulatory standards
(17:22) Kerckhoff's principle in computer security as an analogy
(18:41) The main downsides of these assumptions
(18:46) You can't completely escape from methodological messiness
(19:50) Bad technique prioritization
(21:02) Insane game theory requirements to evaluate techniques
(21:55) Conclusion
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongWork that I’ve done on techniques for mitigating risk from misaligned AI often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paper, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, How to prevent collusion when using untrusted models to monitor each other.) For example:
But I don’t think situations where [...]
---
Outline:
(02:40) The conservative assumption I don't make
(05:36) Hard worlds are plausible
(11:39) Those assumptions are hard to evaluate
(14:57) Why it's nice to rely on these assumptions
(15:08) Making those assumptions makes it methodologically clean to evaluate safety
(16:06) Relying only on easy-to-evaluate assumptions would allow more clearly-specified and hard-to-game regulatory standards
(17:22) Kerckhoff's principle in computer security as an analogy
(18:41) The main downsides of these assumptions
(18:46) You can't completely escape from methodological messiness
(19:50) Bad technique prioritization
(21:02) Insane game theory requirements to evaluate techniques
(21:55) Conclusion
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,847 Listeners

130 Listeners

7,206 Listeners

531 Listeners

16,150 Listeners

4 Listeners

14 Listeners

2 Listeners