The Nonlinear Library

AF - Misgeneralization as a misnomer by Nate Soares


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Misgeneralization as a misnomer, published by Nate Soares on April 6, 2023 on The AI Alignment Forum.
Here's two different ways an AI can turn out unfriendly:
You somehow build an AI that cares about "making people happy". In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it's more capable), it forcibly puts each human in a separate individual heavily-defended cell, and pumps them full of opiates.
You build an AI that's good at making people happy. In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it's more capable), it turns out that whatever was causing that "happiness"-promoting behavior was a balance of a variety of other goals (such as basic desires for energy and memory), and it spends most of the universe on some combination of that other stuff that doesn't involve much happiness.
(To state the obvious: please don't try to get your AIs to pursue "happiness"; you want something more like CEV in the long run, and in the short run I strongly recommend aiming lower, at a pivotal act.)
In both cases, the AI behaves (during training) in a way that looks a lot like trying to make people happy. Then the AI described in (1) is unfriendly because it was optimizing the wrong concept of "happiness", one that lined up with yours when the AI was weak, but that diverges in various edge-cases that matter when the AI is strong. By contrast, the AI described in (2) was never even really trying to pursue happiness; it had a mixture of goals that merely correlated with the training objective, and that balanced out right around where you wanted them to balance out in training, but deployment (and the corresponding capabilities-increases) threw the balance off.
Note that this list of “ways things can go wrong when the AI looked like it was optimizing happiness during training” is not exhaustive! (For instance, consider an AI that cares about something else entirely, and knows you'll shut it down if it doesn't look like it's optimizing for happiness. Or an AI whose goals change heavily as it reflects and self-modifies.)
(This list isn't even really disjoint! You could get both at once, resulting in, e.g., an AI that spends most of the universe’s resources on acquiring memory and energy for unrelated tasks, and a small fraction of the universe on doped-up human-esque shells.)
The solutions to these two problems are pretty different. To resolve the problem sketched in (1), you have to figure out how to get an instance of the AI's concept ("happiness") to match the concept you hoped to transmit, even in the edge-cases and extremes that it will have access to in deployment (when it needs to be powerful enough to pull off some pivotal act that you yourself cannot pull off, and thus capable enough to access extreme edge-case states that you yourself cannot).
To resolve the problem sketched in (2), you have to figure out how to get the AI to care about one concept in particular, rather than a complicated mess that happens to balance precariously on your target ("happiness") in training.
I note this distinction because it seems to me that various people around these parts are either unduly lumping these issues together, or are failing to notice one of them. For example, they seem to me to be mixed together in “The Alignment Problem from a Deep Learning Perspective” under the heading of "goal misgeneralization".
(I think "misgeneralization" is a misleading term in both cases, but it's an even worse fit for (2) than (1). A primate isn't "misgeneralizing" its concept of "inclusive genetic fitness" when it gets smarter and invents condoms; it didn't even really have that concept to misgeneralize, and what shreds of the concept it did have we...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings