
Sign up to save your podcasts
Or


ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy.
In this post, we will:
Thanks to [...]
---
Outline:
(01:05) Backdoors and deceptive alignment
(03:03) Static backdoor detection
(04:41) Dynamic backdoor detection
(06:51) Statistical possibility of dynamic backdoor detection
(09:02) Implications for deceptive alignment
(10:39) Computational impossibility of dynamic backdoor detection
(11:36) Implications for deceptive alignment
(12:54) Conclusion
The original text contained 8 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy.
In this post, we will:
Thanks to [...]
---
Outline:
(01:05) Backdoors and deceptive alignment
(03:03) Static backdoor detection
(04:41) Dynamic backdoor detection
(06:51) Statistical possibility of dynamic backdoor detection
(09:02) Implications for deceptive alignment
(10:39) Computational impossibility of dynamic backdoor detection
(11:36) Implications for deceptive alignment
(12:54) Conclusion
The original text contained 8 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,882 Listeners

130 Listeners

7,216 Listeners

533 Listeners

16,223 Listeners

4 Listeners

14 Listeners

2 Listeners