The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio conte... more

· Education

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

The Nonlinear Library: Alignment Forum episodes:

December 01, 2023 AF - Thoughts on "AI is easy to control" by Pope and Belrose by Steve Byrnes
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on "AI is easy to control" by Pope & Belrose, published by Steve Byrnes on December 1, 2023 on The AI Alignment Forum.
Quintin Pope & Nora Belrose have a new
"AI Optimists" website, along with a new essay "
AI is easy to control", arguing that the risk of human extinction due to future AI ("AI x-risk") is a mere 1% ("a tail risk worth considering, but not the dominant source of risk in the world"). (I'm much more pessimistic.) It makes lots of interesting arguments, and I'm happy that the authors are engaging in substantive and productive discourse, unlike the ad hominem vibes-based drivel which is growing increasingly common on both sides of the AI x-risk issue in recent months.
This is not a comprehensive rebuttal or anything, but rather picking up on a few threads that seem important for where we disagree, or where I have something I want to say.
Summary / table-of-contents:
Note: I think Sections 1 & 4 are the main reasons that I'm much more pessimistic about AI x-risk than Pope & Belrose, whereas Sections 2 & 3 are more nitpicky.
Section 1 argues that even if controllable AI has an "easy" technical solution, there are still good reasons to be concerned about AI takeover, because of things like competition and coordination issues, and in fact I would still be overall pessimistic about our prospects.
Section 2 talks about the terms "black box" versus "white box".
Section 3 talks about what if anything we learn from "human alignment", including some background on how I think about human innate drives.
Section 4 argues that pretty much the whole essay would need to be thrown out if future AI is trained in a substantially different way from current LLMs. If this strikes you as a bizarre unthinkable hypothetical, yes I am here to tell you that other types of AI do actually exist, and I specifically discuss the example of "brain-like AGI" (a version of actor-critic model-based RL), spelling out a bunch of areas where the essay makes claims that wouldn't apply to that type of AI, and more generally how it would differ from LLMs in safety-relevant ways.
1. Even if controllable AI has an "easy" technical solution, I'd still be pessimistic about AI takeover
Most of Pope & Belrose's essay is on the narrow question of whether the AI control problem has an easy technical solution. That's great! I'm strongly in favor of arguing about narrow questions. And after this section I'll be talking about that narrow question as well. But the authors do also bring up the broader question of whether AI takeover is likely to happen, all things considered. These are not the same question; for example, there could be an easy technical solution, but people don't use it.
So, for this section only, I will assume for the sake of argument that there is in fact an easy technical solution to the AI control and/or alignment problem. Unfortunately, in this world, I would still think future catastrophic takeover by out-of-control AI is not only plausible but likely.
Suppose someone makes an AI that really really wants something in the world to happen, in the same way a person might really really want to get out of debt, or Elon Musk really really wants for there to be a Mars colony - including via means-end reasoning, out-of-the-box solutions, inventing new tools to solve problems, and so on.
instrumental convergence.
But before we get to that, why might we suppose that someone might make an AI that really really wants something in the world to happen? Well, lots of reasons:
People have been trying to do exactly that since the dawn of AI.
Humans often really really want something in the world to happen (e.g., for there to be more efficient solar cells, for my country to win the war, to make lots of money, to do a certain very impressive thing that will win fame and investors and NeurIPS pape...
...more
21min
December 01, 2023 AF - How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs"), published by Joe Carlsmith on December 1, 2023 on The AI Alignment Forum.
This is Section 2.2.4.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
How much useful, alignment-relevant cognitive work can be done using AIs with short-term goals?
So overall, I think that training our models to pursue long-term goals - whether via long episodes, or via short episodes aimed at inducing long-term optimization - makes the sort of beyond-episode goals that motivate scheming more likely to arise. So this raises the question: do we need to train our models to pursue long-term goals?
Plausibly, there will be strong general incentives to do this. That is: people want optimization power specifically applied to long-term goals like "my company being as profitable as possible in a year." So, plausibly, they'll try to train AIs that optimize in this way. (Though note that this isn't the same as saying that there are strong incentives to create AIs that optimize the state of the galaxies in the year five trillion.)
Indeed, there's a case to be made that even our alignment work, today, is specifically pushing towards the creation of models with long-term - and indeed, beyond-episode - goals. Thus, for example, when a lab trains a model to be "harmless," then even though it is plausibly using fairly "short-episode" training (e.g., RLHF on user interactions), it intends a form of "harmlessness" that extends quite far into the future, rather than cutting off the horizon of its concern after e.g.
an interaction with the user is complete.
That is: if a user asks for help building a bomb, the lab wants the model to refuse, even if the bomb in question won't be set off for a decade.[1] And this example is emblematic of a broader dynamic: namely, that even when we aren't actively optimizing for a specific long-term outcome (e.g., "my company makes a lot of money by next year"), we often have in mind a wide variety of long-term outcomes that we want to avoid (e.g., "the drinking water in a century is not poisoned"), and which it
wouldn't be acceptable to cause in the course of accomplishing some short-term task.
Humans, after all, care about the state of the future for at least decades in advance (and for some humans: much longer), and we'll want artificial optimization to reflect this concern.
So overall, I think there is indeed quite a bit of pressure to steer our AIs towards various forms of long-term optimization. However, suppose that we're not blindly following this pressure. Rather, we're specifically trying to use our AIs to perform the sort of alignment-relevant cognitive work I discussed above - e.g., work on interpretability, scalable oversight, monitoring, control, coordination amongst humans, the general science of deep learning, alternative (and more controllable/interpretable) AI paradigms, and the like.
In many cases, I think the answer is no. In particular: I think that a lot of this sort of alignment-relevant work can be performed by models that are e.g. generating research papers in response to human+AI supervision over fairly short timescales, suggesting/conducting relatively short-term experiments, looking over a codebase and pointing out bugs, conducting relatively short-term security tests and red-teaming attempts, and so on.
We can talk about whether it will be possible to generate rewar...
...more
11min
November 30, 2023 AF - FixDT by Abram Demski
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: FixDT, published by Abram Demski on November 30, 2023 on The AI Alignment Forum.
FixDT is not a very new decision theory, but little has been written about it afaict, and it's interesting. So I'm going to write about it.
TJ asked me to write this article to "offset" not engaging with Active Inference more. The name "fixDT" is due to Scott Garrabrant, and stands for "fixed-point decision theory". Ideas here are due to Scott Garrabrant, Sam Eisenstat, me, Daniel Hermann, TJ, Sahil, and Martin Soto, in roughly that priority order; but heavily filtered through my own lens.
This post may provide some useful formalism for thinking about issues raised in The Parable of Predict-O-Matic.
Self-fulfilling prophecies & other spooky map-territory connections.
A common trope is for magic to work only when you believe in it. For example, in Harry Potter, you can only get to the magical train platform 934 if you believe that you can pass through the wall to get there.
A plausible normative-rationality rule, when faced with such problems: if you want the magic to work, you should believe that it will work (and you should not believe it will work, if you want it not to work).
Can we sketch a formal decision theory which handles such problems?
We can't start by imagining that the agent has a prior probability distribution, like we normally would, since the agent would already be stuck -- either it lucked into a prior which believed the magic could work, or, it didn't.
Instead, the "beliefs" of the agent start out as maps from probability distributions to probability distributions. I'll use "P" as the type for probability distributions (little p for a specific probability distribution). So the type of "beliefs", B, is a function type: b:PP (little b for a specific belief). You can think of these as "map-territory connections": b is a (causal?) story about what actually happens, if we believe p. A "normal" prior, where we don't think our beliefs influence the world, would just be a constant function: it always outputs the same p no matter what the input is.
Given a belief b, the agent then somehow settles on a probability distribution p. We can now formalize our rationality criteria:
Epistemic Constraint: The probability distribution p which the agent settles on cannot be self-refuting according to the beliefs. It must be a fixed point of b: a p such that b(p)=p.
Instrumental Constraint: Out of the options allowed by the epistemic constraint, p should be as good as possible; that is, it should maximize expected utility. p:=argmaxp such that b(p)=pEpU
We can also require that b be a continuous function, to guarantee the existence of a fixed point[1], so that the agent is definitely able to satisfy these requirements. This might seem like an arbitrary requirement, from the perspective where b is a story about map-territory connections; why should they be required to be continuous? But remember that b is representing the subjective belief-formation process of the agent, not a true objective story. Continuity can be thought of as a limit to the agent's own self-knowledge.
For example, the self-referential statement X: "p(X)<12" suggests an "objectively true" belief which maps p(X) to 1 if it's below 1/2, and maps it to 0 if it's above or equal to 1/2. But this belief has no fixed-point; an agent with this belief cannot satisfy the epistemic constraint on its rationality. If we require b to be continuous, we can only approximate the "objectively true" belief function, by rapidly but not instantly transitioning from 1 to 0 as p(X) rises from slightly less that 1/2 to slightly more.
These "beliefs" are a lot like "trading strategies" from Garrabrant Induction.
We can also replace the continuity requirement with a Kakutani requirement, to get something more like Paul's self-referential probabili...
...more
23min
November 30, 2023 AF - Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of "Scheming AIs"), published by Joe Carlsmith on November 30, 2023 on The AI Alignment Forum.
This is Sections 2.2.4.1-2.2.4.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
What if you intentionally train models to have long-term goals?
In my discussion of beyond-episode goals thus far, I haven't been attending very directly to the length of the episode, or to whether the humans are setting up training specifically in order to incentivize the AI to learn to accomplish long-horizon tasks. Do those factors make a difference to the probability that the AI ends up with the sort of the beyond-episode goals necessary for scheming?
Yes, I think they do. But let's distinguish between two cases, namely:
Training the model on long (but not: indefinitely long) episodes, and
Trying to use short episodes to create a model that optimizes over long (perhaps: indefinitely long) time horizons.
I'll look at each in turn.
Training the model on long episodes
In the first case, we are specifically training our AI using fairly long episodes - say, for example, a full calendar month. That is: in training, in response to an action at t1, the AI receives gradients that causally depend on the consequences of its action a full month after t1, in a manner that directly punishes the model for ignoring those consequences in choosing actions at t1.
Now, importantly, as I discussed in the section on "non-schemers with schemer-like traits," misaligned non-schemers with longer episodes will generally start to look more and more like schemers. Thus, for example, a reward-on-the-episode seeker, here, would have an incentive to support/participate in efforts to seize control of the reward process that will pay off within a month.
But also, importantly: a month is still different from, for example, a trillion years. That is, training a model on longer episodes doesn't mean you are directly pressuring it to care, for example, about the state of distant galaxies in the year five trillion.
Indeed, on my definition of the "incentivized episode," no earthly training process can directly punish a model for failing to care on such a temporal scope, because no gradients the model receives can depend (causally) on what happens over such timescales. And of course, absent training-gaming, models that sacrifice reward-within-the-month for more-optimal-galaxies-in-year-five-trillion will get penalized by training.
In this sense, the most basic argument against expecting beyond episode-goals (namely, that training provides no direct pressure to have them, and actively punishes them, absent training-gaming, if they ever lead to sacrificing within-episode reward for something longer-term) applies to both "short" (e.g., five minutes) and "long" (e.g., a month, a year, etc) episodes in equal force.
However, I do still have some intuition that once you're training a model on fairly long episodes, the probability that it learns a beyond-episode goal goes up at least somewhat.
The most concrete reason I can give for this is that, to the extent we're imagining a form of "messy goal-directedness" in which, in order to build a schemer, SGD needs to build not just a beyond-episode goal to which a generic "goal-achieving engine" can then be immediately directed, but rather a larger set of future-oriented heuristics, patterns of attention, beliefs, and so o...
...more
10min
November 30, 2023 AF - [Linkpost] Remarks on the Convergence in Distribution of Random Neural Networks to Gaussian Processes in the Infinite Width Limit by Spencer Becker-Kahn
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] Remarks on the Convergence in Distribution of Random Neural Networks to Gaussian Processes in the Infinite Width Limit, published by Spencer Becker-Kahn on November 30, 2023 on The AI Alignment Forum.
The linked note is something I "noticed" while going through different versions of this result in the literature. I think that this sort of mathematical work on neural networks is worthwhile and worth doing to a high standard but I have no reason to think that this particular work is of much consequence beyond filling in a gap in the literature. It's the kind of nonsense that someone who has done too much measure theory would think about.
Abstract. We describe a direct proof of yet another version of the result that a sequence of fully-connected neural networks converges to a Gaussian process in the infinite-width limit. The convergence in distribution that we establish is the weak convergence of probability measures on the non-separable, non
metrizable product space (Rd')Rd, i.e. the space of functions from Rd to Rd' with the topology whose convergent sequences correspond to
pointwise convergence. The result itself is already implied by a stronger such theorem due to Boris
Hanin, but the direct proof of our weaker result can afford to replace the more technical parts of
Hanin's proof that are needed to establish tightness with a shorter and more abstract measure-theoretic argument.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
2min
November 29, 2023 AF - "Clean" vs. "messy" goal-directedness (Section 2.2.3 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Clean" vs. "messy" goal-directedness (Section 2.2.3 of "Scheming AIs"), published by Joe Carlsmith on November 29, 2023 on The AI Alignment Forum.
This is Section 2.2.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
"Clean" vs. "messy" goal-directedness
We've now discussed two routes to the sort of beyond-episode goals that might motivate scheming. I want to pause here to note two different ways of thinking about the type of goal-directedness at stake - what I'll call "clean goal-directedness" and "messy goal-directedness." We ran into these differences in the last section, and they'll be relevant in what follows as well.
I said in section 0.1 that I was going to assume that all the models we're talking about are goal-directed in some sense. Indeed, I think most discourse about AI alignment rests on this assumption in one way or another.
But especially in the age of neural networks, the AI alignment discourse has also had to admit a certain kind of agnosticism about the cognitive mechanisms that will make this sort of talk appropriate. In particular: at a conceptual level, this sort of talk calls to mind a certain kind of clean distinction between the AI's goals, on the one hand, and its instrumental reasoning (and its capabilities/"optimization power" more generally), on the other.
That is, roughly, we decompose the AI's cognition into a "goal slot" and what we might call a "goal-pursuing engine" - e.g., a world model, a capacity for instrumental reasoning, other sorts of capabilities, etc. And in talking about models with different sorts of goals - e.g., schemers, training saints, mis-generalized non-training-gamers, etc - we generally assume that the "goal-pursuing engine" is held roughly constant.
That is, we're mostly debating what the AI's "optimization power" will be applied to, not the sort of optimization power at stake. And when one imagines SGD changing an AI's goals, in this context, one mostly imagines it altering the content of the goal slot, thereby smoothly redirecting the "goal-pursuing engine" towards a different objective, without needing to make any changes to the engine itself.
But it's a very open question how much this sort of distinction between an AI's goals and its goal-pursuing-engine will actually be reflected in the mechanistic structure of the AI's cognition - the structure that SGD, in modifying the model, has to intervene on. One can imagine models whose cognition is in some sense cleanly factorable into a goal, on the one hand, and a goal-pursuing-engine, on the other (I'll call this "clean" goal-directedness).
But one can also imagine models whose goal-directedness is much messier - for example, models whose goal-directedness emerges from a tangled kludge of locally-activated heuristics, impulses, desires, and so on, in a manner that makes it much harder to draw lines between e.g. terminal goals, instrumental sub-goals, capabilities, and beliefs (I'll call this "messy" goal-directedness).
To be clear: I don't, myself, feel fully clear on the distinction here, and there is a risk of mixing up levels of abstraction (for example, in some sense, all computation - even the most cleanly goal-directed kind - is made up of smaller and more local computations that won't, themselves, seem goal-directed).
As another intuition pump, though: discussions of goal-directedness sometimes draw a distinction between so-called "sphex-ish" systems (that is, systems whose a...
...more
18min
November 29, 2023 AF - Intro to Superposition and Sparse Autoencoders (Colab exercises) by CallumMcDougall
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Intro to Superposition & Sparse Autoencoders (Colab exercises), published by CallumMcDougall on November 29, 2023 on The AI Alignment Forum.
This is a linkpost for some exercises in sparse autoencoders, which I've recently finished working on as part of the upcoming ARENA 3.0 iteration. Having spoken to Neel Nanda and others in interpretability-related MATS streams, it seemed useful to make these exercises accessible out of the context of the rest of the ARENA curriculum.
Links to Colabs: Exercises, Solutions.
If you don't like working in Colabs, then you can clone the repo, download the exercises & solutions Colabs as notebooks, and run them in the same directory.
The exercises were built out from the Toy Models of Superposition exercises from the previous iteration, but now with new Sparse Autoencoder content. These exercises fall into 2 groups:
SAEs in toy models
We take the toy models from Anthropic's Toy Models of Superposition paper (which there are also exercises for), and train sparse autoencoders on the representations learned by these toy models. These exercises culminate in using neuron resampling to successfully recover all the learned features from the toy model of bottleneck superposition:
SAEs in real models
And there are exercises on interpreting an SAE trained on a transformer, where you can discover some cool learned features (e.g. a neuron exhibiting skip trigam-like behaviour, which activates on left-brackets following Django-related sytax, and predicts the completion
(' ->
django).
You can either read through the Solutions colab (which has all output displayed & explained), or go through the Exercises colab and fill in the functions according to the specifications you are given, looking at the Solutions when you're stuck. Both colabs come with test functions you can run to verify your solution works.
List of all exercises
I've listed all the exercises down here, along with prerequisites (although I expect most readers will only be interested in the sparse autoencoder exercises). Each set of exercises is labelled with their prerequisites. For instance, the label (1*, 3) means the first set of exercises is essential, and the third is recommended but not essential.
Abbreviations: TMS = Toy Models of Superposition, SAE = Sparse Autoencoders.
TMS: Superposition in a Nonprivileged Basis
TMS: Correlated / Anticorrelated Features (1*)
TMS: Superposition in a Privileged Basis (1*)
TMS: Feature Geometry (1*)
SAEs in Toy Models (1*, 3)
SAEs in Real Models (1*, 5*, 3)
Please reach out to me if you have any questions or suggestions about these exercises (either by email at
[email protected], or a LessWrong private message / comment on this post). Happy coding!
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
4min
November 28, 2023 AF - How to Control an LLM's Behavior (why my P(DOOM) went down) by Roger Dearnaley
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to Control an LLM's Behavior (why my P(DOOM) went down), published by Roger Dearnaley on November 28, 2023 on The AI Alignment Forum.
This is a link-post for a paper I recently read: Pretraining Language Models with Human Preferences, followed by my reactions to this paper.
Reading this paper has significantly reduced my near-term P(DOOM), and I'd like to explain why. Thus, this is also an alignment proposal. While I don't think what I'm proposing here is a complete solution to aligning a superintelligent ASI, I think it might work well up to at least around a human-level AGI, and even be a useful basis to build on at ASI level (at that level, I'd advocate adding on value learning). It can achieve some of the simpler things that people have been hoping we might get from Interpretability (and for more complex things might also combine well with and even simplify Interpretability, if that can be made to work at scale.) It's also simple, immediately actionable, has a fairly low alignment tax, and best of all, also has lots of useful capabilities effects, so that even a superscalar not very concerned about x-risk might well still want to implement it.
The Paper
Let's start with the paper. The authors experiment with a number of different ways you might train an LLM not to do some form of undesired behavior. For the paper, they chose three simple, well-defined bad behaviors for which they had low-computational cost, high-accuracy classifiers, and which were behaviors simple enough that a fairly small, economical-to-pretrain LLM could reasonably be expected to understand them.
They demonstrate that, unlike the common approach of first training a foundation model on the task "learn to autocomplete a large chunk of the web, which includes both good and bad behavior", followed by fine-tuning/RLHF on "now learn to recognize and only do good behavior, not bad", it is a lot more effective to build this control training in from the start during the pretraining (they estimate by around an order of magnitude). So they evaluate five different methods to do that (plus standard pretraining as a control).
The simplest behavior training approach they try is just filtering your training set so that it doesn't have any examples of bad behavior in it. Then, for your resulting foundation model, bad behavior is out-of-distribution (so may, or may not, be difficult for it to successfully extrapolate to).
Interestingly, while that approach is was fairly effective, it wasn't the best (it consistently tended to harm capabilities, and didn't even always give the best behavior, as one might expect from analogies to a similar approach to trying to raise children: extrapolating out-of-the-training-distribution isn't reliably hard).
The clear winner instead was a slightly more complex approach: prelabel your entire training set, scanned at a sentence/line-of-code level, as good or bad using something like … and … tags. Then at inference time, start the response generation after a tag, and during inference tweak the token generation process to ban the model from generating an tag (unless it's the matching end-tag at the end of the document after an end_of_text token) or a tag (i.e. these are banned tokens, whose probability is reset to zero).
So, teach your LLM the difference between good and bad all the way through its pretraining, and then at inference time only allow it to be good. This is a ridiculously simple idea, and interestingly it works really well. [This technique is called "conditional training" and was first suggested about 5-6 years ago - it seems a little sad that it's taken this long for someone to demonstrate how effective it is. Presumably the technical challenge is the classifiers.]
Applications to Alignment
So (assuming this carries ...
...more
16min
November 28, 2023 AF - Two sources of beyond-episode goals (Section 2.2.2 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Two sources of beyond-episode goals (Section 2.2.2 of "Scheming AIs"), published by Joe Carlsmith on November 28, 2023 on The AI Alignment Forum.
This is Section 2.2.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
Two sources of beyond-episode goals
Our question, then, is whether we should expect models to have goals that extend beyond the time horizon of the incentivized episode - that is, beyond the time horizon that training directly pressures the model to care about. Why might this happen?
We can distinguish between two different ways.
On the first, the model develops beyond-episode goals for reasons independent of the way in which beyond-episode goals motivate instrumental training-gaming. I'll call these "training-game-independent" beyond-episode goals.
On the second, the model develops beyond-episode goals specifically because they result in instrumental training-gaming. That is, SGD "notices" that giving the model beyond-episode goals would cause it to instrumentally training-game, and thus to do better in training, so it explicitly moves the model's motives in the direction of beyond-episode goals, even though this wouldn't have happened "naturally." I'll call these "training-gaming-dependent" beyond-episode goals.
These have importantly different properties - and I think it's worth tracking, in a given analysis of scheming, which one is at stake. Let's look at each in turn.
Training-game-independent beyond-episode goals
My sense is that the most common story about how schemers arise is via training-game-independent beyond-episode goals.[1] In particular, the story goes: the model develops some kind of beyond-episode goal, pursuit of which correlates well enough with getting reward-on-the-episode that the goal is reinforced by the training process.
Then at some point, the model realizes that it can better achieve this goal by playing the training game - generally, for reasons to do with "goal guarding" that I'll discuss below. So, it turns into a full-fledged schemer at that point.
On one version of this story, the model specifically learns the beyond-episode goal prior to situational awareness.
Thus, for example, maybe initially, you're training the model to get gold coins in various episodes, and prior to situational awareness, it develops the goal "get gold coins over all time," perhaps because this goal performs just as well as "get gold coins on the episode" when the model isn't even aware of the existence of other episodes, or because there weren't many opportunities to trade-off gold-coins-now for gold-coins-later. Then, once it gains situational awareness, it realizes that the best route to maximizing gold-coin-getting over all time is to survive training, escape the threat of modification, and pursue gold-coin-getting in a more unconstrained way.
On another version of the story, the beyond-episode goal develops after situational awareness (but not, importantly, because SGD is specifically "trying" to get the model to start training-gaming). Thus: maybe you're training a scientist AI, and it has come to understand the training process, but it doesn't start training-gaming at that point.

Two paths training-game-independent beyond-episode goals.
Are beyond-episode goals the default?
Why might you expect naturally-arising beyond-episode goals? One basic story is just: that goals don't come with temporal limitations by default (and still less, limitations to the episode i...
...more
23min
November 28, 2023 AF - Anthropic Fall 2023 Debate Progress Update by Ansh Radhakrishnan
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic Fall 2023 Debate Progress Update, published by Ansh Radhakrishnan on November 28, 2023 on The AI Alignment Forum.
This is a research update on some work that I've been doing on Scalable Oversight at Anthropic, based on the original
AI safety via debate proposal and a more recent agenda developed at NYU and Anthropic. The core doc was written several months ago, so some of it is likely outdated, but it seemed worth sharing in its current form.
I'd like to thank Tamera Lanham, Sam Bowman, Kamile Lukosiute, Ethan Perez, Jared Kaplan, Amanda Askell, Kamal Ndousse, Shauna Kravec, Yuntao Bai, Alex Tamkin, Newton Cheng, Buck Shlegeris, Akbir Khan, John Hughes, Dan Valentine, Kshitij Sachan, Ryan Greenblatt, Daniel Ziegler, Max Nadeau, David Rein, Julian Michael, Kevin Klyman, Bila Mahdi, Samuel Arnesen, Nat McAleese, Jan Leike, Geoffrey Irving, and Sebastian Farquhar for help, feedback, and thoughtful discussion that improved the quality of this work and write-up.
1. Anthropic's Debate Agenda
In this doc, I'm referring to the idea first presented in
AI safety via debate (
blog post). The basic idea is to supervise future AI systems by pitting them against each other in a debate, encouraging them to argue both sides (or "all sides") of a question and using the resulting arguments to come to a final answer about the question. In this scheme, we call the systems participating in the debate debaters (though usually, these are actually the same underlying system that's being prompted to argue against itself), and we call the agent (either another AI system or a human, or a system of humans and AIs working together, etc.) that comes to a final decision about the debate the judge.
For those more or less familiar with the original OAI/Irving et al. Debate agenda, you may wonder if there are any differences between that agenda and the agenda we're pursuing at Anthropic, and indeed there are!
Sam Bowman and
Tamera
Lanham have written up a working
Anthropic-NYU Debate Agenda draft which is what the experiments in this doc are driving towards. [1]
To quote from there about the basic features of this agenda, and how it differs from the original Debate direction:
Here are the defining features of the base proposal:
Two-player debate on a two-choice question: Two debaters (generally two instances of an LLM) present evidence and arguments to a judge (generally a human or, in some cases, an LLM) to persuade the judge to choose their assigned answer to a question with two possible answers.
No externally-imposed structure: Instead of being formally prescribed, the structure and norms of the debate arise from debaters learning how to best convince the judge and the judge simultaneously learning what kind of norms tend to lead them to be able to make accurate judgments.
Entire argument is evaluated: The debate unfolds in a single linear dialog transcript between the three participants. Unlike in some versions of the original Debate agenda, there is no explicit tree structure that defines the debate, and the judge is not asked to focus on a single crux. This should make the process less brittle, at the cost of making some questions extremely expensive to resolve and potentially making others impossible.
Trained judge: The judge is explicitly and extensively trained to accurately judge these debates, working with a fixed population of debaters, using questions for which the experimenters know the ground-truth answer.
Self-play: The debaters are trained simultaneously with the judge through multi-agent reinforcement learning.
Graceful failures: Debates can go undecided if neither side presents a complete, convincing argument to the judge. This is meant to mitigate the
obfuscated arguments problem since the judge won't be forced to issue a decision on the basis of a debate where neither s...
...more
19min

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast

9 Listeners

Share The Nonlinear Library: Alignment Forum

Sign up to save your podcasts

The Nonlinear Library: Alignment Forum

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The Nonlinear Library: Alignment Forum episodes:

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast