Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on "AI is easy to control" by Pope & Belrose, published by Steve Byrnes on December 1, 2023 on The AI Alignment Forum.
Quintin Pope & Nora Belrose have a new
"AI Optimists" website, along with a new essay "
AI is easy to control", arguing that the risk of human extinction due to future AI ("AI x-risk") is a mere 1% ("a tail risk worth considering, but not the dominant source of risk in the world"). (I'm much more pessimistic.) It makes lots of interesting arguments, and I'm happy that the authors are engaging in substantive and productive discourse, unlike the ad hominem vibes-based drivel which is growing increasingly common on both sides of the AI x-risk issue in recent months.
This is not a comprehensive rebuttal or anything, but rather picking up on a few threads that seem important for where we disagree, or where I have something I want to say.
Summary / table-of-contents:
Note: I think Sections 1 & 4 are the main reasons that I'm much more pessimistic about AI x-risk than Pope & Belrose, whereas Sections 2 & 3 are more nitpicky.
Section 1 argues that even if controllable AI has an "easy" technical solution, there are still good reasons to be concerned about AI takeover, because of things like competition and coordination issues, and in fact I would still be overall pessimistic about our prospects.
Section 2 talks about the terms "black box" versus "white box".
Section 3 talks about what if anything we learn from "human alignment", including some background on how I think about human innate drives.
Section 4 argues that pretty much the whole essay would need to be thrown out if future AI is trained in a substantially different way from current LLMs. If this strikes you as a bizarre unthinkable hypothetical, yes I am here to tell you that other types of AI do actually exist, and I specifically discuss the example of "brain-like AGI" (a version of actor-critic model-based RL), spelling out a bunch of areas where the essay makes claims that wouldn't apply to that type of AI, and more generally how it would differ from LLMs in safety-relevant ways.
1. Even if controllable AI has an "easy" technical solution, I'd still be pessimistic about AI takeover
Most of Pope & Belrose's essay is on the narrow question of whether the AI control problem has an easy technical solution. That's great! I'm strongly in favor of arguing about narrow questions. And after this section I'll be talking about that narrow question as well. But the authors do also bring up the broader question of whether AI takeover is likely to happen, all things considered. These are not the same question; for example, there could be an easy technical solution, but people don't use it.
So, for this section only, I will assume for the sake of argument that there is in fact an easy technical solution to the AI control and/or alignment problem. Unfortunately, in this world, I would still think future catastrophic takeover by out-of-control AI is not only plausible but likely.
Suppose someone makes an AI that really really wants something in the world to happen, in the same way a person might really really want to get out of debt, or Elon Musk really really wants for there to be a Mars colony - including via means-end reasoning, out-of-the-box solutions, inventing new tools to solve problems, and so on.
instrumental convergence.
But before we get to that, why might we suppose that someone might make an AI that really really wants something in the world to happen? Well, lots of reasons:
People have been trying to do exactly that since the dawn of AI.
Humans often really really want something in the world to happen (e.g., for there to be more efficient solar cells, for my country to win the war, to make lots of money, to do a certain very impressive thing that will win fame and investors and NeurIPS pape...