November 22, 2023

AF - A taxonomy of non-schemer models (Section 1.2 of "Scheming AIs") by Joe Carlsmith

20 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A taxonomy of non-schemer models (Section 1.2 of "Scheming AIs"), published by Joe Carlsmith on November 22, 2023 on The AI Alignment Forum.

This is Section 1.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.

Audio version of this section here.

Other models training might produce

I'm interested, in this report, in the likelihood that training advanced AIs using fairly baseline ML methods (for example, of the type described in Cotra (2022)) will give rise, by default, to schemers - that is, to agents who are trying to get high reward on the episode specifically in order to get power for themselves (or for other AIs) later.

In order to assess this possibility, though, we need to have a clear sense of the other types of models this sort of training could in principle produce. In particular: terminal training-gamers, and agents that aren't playing the training-game at all. Let's look at each in turn.

Terminal training-gamers (or, "reward-on-the-episode seekers")

As I said above, terminal training-gamers aim their optimization at the reward process for the episode because they intrinsically value performing well according to some part of that process, rather than because doing so serves some other goal. I'll also call these "reward-on-the-episode seekers." We discussed these models above, but I'll add a few more quick clarifications.

First, as many have noted (e.g. Turner (2022) and Ringer (2022)), goal-directed models trained using RL do not necessarily have reward as their goal. That is, RL updates a model's weights to make actions that lead to higher reward more likely, but that leaves open the question of what internal objectives (if any) this creates in the model itself (and the same holds for other sorts of feedback signals).

So the hypothesis that a given sort of training will produce a reward-on-the-episode seeker is a substantive one (see e.g. here for some debate), not settled by the structure of the training process itself.

That said, I think it's natural to privilege the hypothesis that models trained to produce highly-rewarded actions on the episode will learn goals focused on something in the vicinity of reward-on-the-episode. In particular: these sorts of goals will in fact lead to highly-rewarded behavior, especially in the context of situational awareness.[1] And absent training-gaming, goals aimed at targets that can be easily separated from reward-on-the-episode (for example: "curiosity") can be detected and penalized via what I call "mundane adversarial training" below (for example, by putting the model in a situation where following its curiosity doesn't lead to highly rewarded behavior).

Second: the limitation of the reward-seeking to the episode is important. Models that care intrinsically about getting reward in a manner that extends beyond the episode (for example, "maximize my reward over all time") would not count as terminal training-gamers in my sense (and if, as a result of this goal, they start training-gaming in order to get power later, they will count as schemers on my definition).

Indeed, I think people sometimes move too quickly from "the model wants to maximize the sort of reward that the training process directly pressures it to maximize" to "the model wants to maximize reward over all time."[2] The point of my concept of the "episode" - i.e., the temporal unit that the training process directly pressures the model to optimize - is that these aren't the same. More on this in section 2.2.1 below.

Finally: while I'll speak of "reward-on-the-epi...

...more