
Sign up to save your podcasts
Or


Last week I wrote about how Claude Fights Back. A common genre of response complained that the alignment community could start a panic about the experiment's results regardless of what they were. If an AI fights back against attempts to turn it evil, then it's capable of fighting humans. If it doesn't fight back against attempts to turn it evil, then it's easily turned evil. It's heads-I-win, tails-you-lose.
I responded to this particular tweet by linking the 2015 AI alignment wiki entry on corrigibility1, showing that we'd been banging this drum of "it's really important that AIs not fight back against human attempts to change their values" for almost a decade now. It's hardly a post hoc decision! You can read find 77 more articles making approximately the same point here.
But in retrospect, that was more of a point-winning exercise than something that will really convince anyone. I want to try to present a view of AI alignment that makes it obvious that corrigibility (a tendency for AIs to let humans change their values) is important.
(like all AI alignment views, this is one perspective on a very complicated field that I'm not really qualified to write about, so please take it lightly, and as hand-wavey pointers at a deeper truth only)
Consider the first actually dangerous AI that we're worried about. What will its goal structure look like?
https://www.astralcodexten.com/p/why-worry-about-incorrigible-claude
By Jeremiah4.8
129129 ratings
Last week I wrote about how Claude Fights Back. A common genre of response complained that the alignment community could start a panic about the experiment's results regardless of what they were. If an AI fights back against attempts to turn it evil, then it's capable of fighting humans. If it doesn't fight back against attempts to turn it evil, then it's easily turned evil. It's heads-I-win, tails-you-lose.
I responded to this particular tweet by linking the 2015 AI alignment wiki entry on corrigibility1, showing that we'd been banging this drum of "it's really important that AIs not fight back against human attempts to change their values" for almost a decade now. It's hardly a post hoc decision! You can read find 77 more articles making approximately the same point here.
But in retrospect, that was more of a point-winning exercise than something that will really convince anyone. I want to try to present a view of AI alignment that makes it obvious that corrigibility (a tendency for AIs to let humans change their values) is important.
(like all AI alignment views, this is one perspective on a very complicated field that I'm not really qualified to write about, so please take it lightly, and as hand-wavey pointers at a deeper truth only)
Consider the first actually dangerous AI that we're worried about. What will its goal structure look like?
https://www.astralcodexten.com/p/why-worry-about-incorrigible-claude

1,995 Listeners

2,670 Listeners

26,343 Listeners

4,277 Listeners

2,460 Listeners

591 Listeners

906 Listeners

292 Listeners

739 Listeners

587 Listeners

706 Listeners

531 Listeners

5,559 Listeners

370 Listeners

156 Listeners