LessWrong (30+ Karma)

“39 - Evan Hubinger on Model Organisms of Misalignment” by DanielFilan


Listen Later

YouTube link

The ‘model organisms of misalignment’ line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: “Sleeper Agents” and “Sycophancy to Subterfuge”.

Topics we discuss:

  • Model organisms and stress-testing
  • Sleeper Agents
  • Do ‘sleeper agents’ properly model deceptive alignment?
  • Surprising results in “Sleeper Agents”
  • Sycophancy to Subterfuge
  • How models generalize from sycophancy to subterfuge
  • Is the reward editing task valid?
  • Training away sycophancy and subterfuge
  • Model organisms, AI control, and evaluations
  • Other model organisms research
  • Alignment stress-testing at Anthropic
  • Following Evan's work
  • Daniel Filan:

    Hello, everybody. In this episode, I’ll be speaking with Evan Hubinger. [...]

    ---

    Outline:

    (01:46) Model organisms and stress-testing

    (09:02) Sleeper Agents

    (25:18) Do ‘sleeper agents’ properly model deceptive alignment?

    (42:08) Surprising results in “Sleeper Agents”

    (01:02:51) Sycophancy to Subterfuge

    (01:15:27) How models generalize from sycophancy to subterfuge

    (01:23:27) Is the reward editing task valid?

    (01:28:53) Training away sycophancy and subterfuge

    (01:36:42) Model organisms, AI control, and evaluations

    (01:41:12) Other model organisms research

    (01:43:11) Alignment stress-testing at Anthropic

    (01:51:07) Following Evan's work

    ---

    First published:

    December 1st, 2024

    Source:

    https://www.lesswrong.com/posts/sookiqxkzzLmPYB3r/39-evan-hubinger-on-model-organisms-of-misalignment

    ---

    Narrated by TYPE III AUDIO.

    ...more
    View all episodesView all episodes
    Download on the App Store

    LessWrong (30+ Karma)By LessWrong


    More shows like LessWrong (30+ Karma)

    View all
    The Daily by The New York Times

    The Daily

    112,843 Listeners

    Astral Codex Ten Podcast by Jeremiah

    Astral Codex Ten Podcast

    130 Listeners

    Interesting Times with Ross Douthat by New York Times Opinion

    Interesting Times with Ross Douthat

    7,214 Listeners

    Dwarkesh Podcast by Dwarkesh Patel

    Dwarkesh Podcast

    531 Listeners

    The Ezra Klein Show by New York Times Opinion

    The Ezra Klein Show

    16,223 Listeners

    AI Article Readings by Readings of great articles in AI voices

    AI Article Readings

    4 Listeners

    Doom Debates by Liron Shapira

    Doom Debates

    14 Listeners

    LessWrong posts by zvi by zvi

    LessWrong posts by zvi

    2 Listeners