LessWrong (30+ Karma)

“Backdoors as an analogy for deceptive alignment” by Jacob_Hilton


Listen Later

ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy.

In this post, we will:

  • Lay out the analogy between backdoors and deceptive alignment
  • Discuss prior theoretical results from the perspective of this analogy
  • Explain our formal notion of backdoors and its strengths and weaknesses
  • Summarize the results in our paper and discuss their implications for deceptive alignment
  • Thanks to [...]

    ---

    Outline:

    (01:05) Backdoors and deceptive alignment

    (03:03) Static backdoor detection

    (04:41) Dynamic backdoor detection

    (06:51) Statistical possibility of dynamic backdoor detection

    (09:02) Implications for deceptive alignment

    (10:39) Computational impossibility of dynamic backdoor detection

    (11:36) Implications for deceptive alignment

    (12:54) Conclusion

    The original text contained 8 footnotes which were omitted from this narration.

    The original text contained 1 image which was described by AI.

    ---

    First published:

    September 6th, 2024

    Source:

    https://www.lesswrong.com/posts/efwcZ35LwS6HgFcN8/backdoors-as-an-analogy-for-deceptive-alignment

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    ...more
    View all episodesView all episodes
    Download on the App Store

    LessWrong (30+ Karma)By LessWrong


    More shows like LessWrong (30+ Karma)

    View all
    The Daily by The New York Times

    The Daily

    112,882 Listeners

    Astral Codex Ten Podcast by Jeremiah

    Astral Codex Ten Podcast

    130 Listeners

    Interesting Times with Ross Douthat by New York Times Opinion

    Interesting Times with Ross Douthat

    7,216 Listeners

    Dwarkesh Podcast by Dwarkesh Patel

    Dwarkesh Podcast

    533 Listeners

    The Ezra Klein Show by New York Times Opinion

    The Ezra Klein Show

    16,223 Listeners

    AI Article Readings by Readings of great articles in AI voices

    AI Article Readings

    4 Listeners

    Doom Debates by Liron Shapira

    Doom Debates

    14 Listeners

    LessWrong posts by zvi by zvi

    LessWrong posts by zvi

    2 Listeners