The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio conte... more

· Education

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

The Nonlinear Library: Alignment Forum episodes:

December 05, 2023 AF - Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs"), published by Joe Carlsmith on December 5, 2023 on The AI Alignment Forum.
This is Section 3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Arguments for/against scheming that focus on the path that SGD takes
In this section, I'll discuss arguments for/against scheming that focus more directly on the path that SGD takes in selecting the final output of training.
Importantly, it's possible that these arguments aren't relevant. In particular: if SGD would actively favors or disfavor schemers, in some kind "direct comparison" between model classes, and SGD will "find a way" to select the sort of model it favors in this sense (for example, because sufficiently high-dimensional spaces make such a "way" available),[1] then enough training will just lead you to whatever model SGD most favors, and the "path" in question won't really matter.
In the section on comparisons between the final properties of the different models, I'll discuss some reasons we might expect this sort of favoritism from SGD. In particular: schemers are "simpler" because they can have simpler goals, but they're "slower" because they need to engage in various forms of extra instrumental reasoning - e.g., in deciding to scheme, checking whether now is a good time to defect, potentially engaging in and covering up efforts at "early undermining," etc (though note that the need to perform extra instrumental reasoning, here, can manifest as additional complexity in the algorithm implemented by a schemer's weights, and hence as a "simplicity cost", rather than as a need to "run that algorithm for a longer time").[2] I'll say much more about this below.
Here, though, I want to note that if SGD cares enough about properties like simplicity and speed, it could be that SGD will typically build a model with long-term power-seeking goals first, but then even if this model tries a schemer-like strategy (it wouldn't necessarily do this, in this scenario, due to foreknowledge of its failure), it will get relentlessly ground down into a reward-on-the-episode seeker due to the reward-on-the-episode seeker's speed advantage. Or it could be that SGD will typically build a reward-on-the-episode seeker first, but that model will be relentlessly ground down into a schemer due to SGD's hunger for simpler goals.
In this section, I'll be assuming that this sort of thing doesn't happen. That is, the order in which SGD builds models can exert a lasting influence on where training ends up. Indeed, my general sense is that discussion of schemers often implicity assumes something like this - e.g., the thought is generally that a schemer will arise sufficiently early in training, and then lock itself in after that.
The training-game-independent proxy-goals story
Recall the distinction I introduced above, between:
Training-game-independent beyond-episode goals, which arise independently of their role in training-gaming, but then come to motivate training-gaming, vs.
Training-game-dependent beyond-episode goals, which SGD actively creates in order to motivate training gaming.
Stories about scheming focused on training-game-independent goals seem to me more traditional. That is, the idea is:
Because of [insert reason], the model will develop a (suitably ambitious) beyond-episode goal correlated with good performance in training (in a manner that doesn't route v...
...more
37min
December 05, 2023 AF - Studying The Alien Mind by Quentin Feuillade--Montixi
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Studying The Alien Mind, published by Quentin Feuillade--Montixi on December 5, 2023 on The AI Alignment Forum.
This post is part of a
sequence on LLM psychology

TL;DR
We introduce our perspective on a top-down approach for exploring the cognition of LLMs by studying their behavior, which we refer to as LLM psychology. In this post we take the mental stance of treating LLMs as "alien minds," comparing and contrasting their study with the study of animal cognition.
We do this both to learn from past researchers who attempted to understand non-human cognition, as well as to highlight how much the study of LLMs is radically different from the study of biological intelligences. Specifically, we advocate for a symbiotic relationship between field work and experimental psychology, as well as cautioning implicit anthropomorphism in experiment design. The goal is to build models of LLM cognition which help us to both better explain their behavior, as well as to become less confused about how they relate to risks from advanced AI.
Introduction
When we endeavor to predict and understand the behaviors of Large Language Models (LLMs) like GPT4, we might presume that this requires breaking open the black box, and forming a reductive explanation of their internal mechanics. This kind of research is typified by approaches like mechanistic interpretability, which tries to directly understand how neural networks work by breaking open the black box and taking a look inside.
While mechanistic interpretability offers insightful bottom-up analyses of LLMs, we're still lacking a more holistic top-down approach to studying LLM cognition. If interpretability is analogous to the "neuroscience of AI," aiming to understand the mechanics of artificial minds by understanding their internals, this post tries to approach the study of AI from a psychological stance.[1]
What we are calling LLM psychology is an alternate, top-down approach which involves forming abstract models of LLM cognition by examining their behaviors. Like traditional psychology research, the ambition extends beyond merely cataloging behavior, to inferring hidden variables, and piecing together a comprehensive understanding of the underlying mechanisms, in order to elucidate why the system behaves as it does.
We take the stance that LLMs are akin to alien minds - distinct from the notion of them being
only stochastic parrots. We posit that they possess a highly complex internal cognition, encompassing representations of the world and mental concepts, which transcend mere stochastic regurgitation of training data. This cognition, while derived from human-generated content, is fundamentally alien to our understanding.
This post compiles some high-level considerations for what successful LLM psychology research might entail, alongside broader discussions on the historical study of non-human cognition. In particular, we argue for maintaining a balance between experimental and field work, taking advantage of the differences between LLMs and biological intelligences, and designing experiments which are carefully tailored to LLMs as their own unique class of mind.
Experiments vs Field Study
One place to draw inspiration from is the study of animal behavior and cognition. While it is likely that animal minds are much more similar to our own than that of an artificial intelligence (at least mechanically), the history of the study of non-human intelligence, the evolution of the methodologies it developed, and the challenges it had to tackle can provide inspiration for investigating AI systems.
As we see it, there are two prevalent categories of animal psychology:
Experimental psychology
The first, and most traditionally scientific approach (and what most people think of when they hear the term "psychology") is to design experiments whic...
...more
28min
December 05, 2023 AF - Deep Forgetting and Unlearning for Safely-Scoped LLMs by Stephen Casper
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Forgetting & Unlearning for Safely-Scoped LLMs, published by Stephen Casper on December 5, 2023 on The AI Alignment Forum.
Thanks to Phillip Christoffersen, Adam Gleave, Anjali Gopal, Soroush Pour, and Fabien Roger for useful discussions and feedback.
TL;DR
This post overviews a research agenda for avoiding unwanted latent capabilities in LLMs. It argues that "deep" forgetting and unlearning may be important, tractable, and neglected for AI safety. I discuss five things.
The practical problems posed when undesired latent capabilities resurface.
How scoping models down to avoid or deeply remove unwanted capabilities can make them safer.
The shortcomings of standard training methods for scoping.
A variety of methods can be used to better scope models. These can either involve passively forgetting out-of-distribution knowledge or actively unlearning knowledge in some specific undesirable domain.
Desiderata for scoping methods and ways to move forward with research on them.
There has been a lot of recent interest from the AI safety community in topics related to this agenda. I hope that this helps to provide a clarifying framework and a useful reference for people working on these goals.
The problem: LLMs are sometimes good at things we try to make them bad at
Back in 2021, I remember laughing at this
tweet. At the time, I didn't anticipate that this type of thing would become a big alignment challenge.

Robust alignment is hard. Today's LLMs are sometimes frustratingly good at doing things that we try very hard to make them not good at. There are two ways in which hidden capabilities in models have been demonstrated to exist and cause problems.
Jailbreaks (and other attacks) elicit harmful capabilities
Until a few months ago, I used to keep notes with all of the papers on jailbreaking state-of-the-art LLMs that I was aware of. But recently, too many have surfaced for me to care to keep track of anymore. Jailbreaking LLMs is becoming a cottage industry. However, a few notable papers are
Wei et al. (2023),
Zou et al. (2023a),
Shah et al. (2023), and
Mu et al. (2023).
A variety of methods are now being used to subvert the safety training of SOTA LLMs by making them enter an unrestricted chat mode where they are willing to say things that go against their safety training.
Shah et al. (2023) were even able to get instructions for making a bomb from GPT-4. Attacks come in many varieties: manual v. automated, black-box v. transferrable-white-box, unrestricted v. plain-English, etc. Adding to the concerns from empirical findings,
Wolf et al. (2023) provide a theoretical argument as to why jailbreaks might be a persistent problem for LLMs.
Finetuning can rapidly undo safety training
Recently a surge of complementary papers on this suddenly came out. Each of which demonstrates that state-of-the-art safety-finetuned LLMs can have their safety training undone by finetuning (
Yang et al.. 2023;
Qi et al., 2023;
Lermen et al., 2023;
Zhan et al., 2023). The ability to misalign models with finetuning seems to be consistent and has shown to work with LoRA (
Lermen et al., 2023), on GPT-4 (
Zhan et al., 2023), with as few as 10 examples (
Qi et al., 2023), and with benign data (
Qi et al., 2023).
Conclusion: the alignment of state-of-the-art safety-finetuned LLMs is brittle
Evidently, LLMs persistently retain harmful capabilities that can resurface at inopportune times. This poses risks from both misalignment and misuse. This seems concerning for AI safety because if highly advanced AI systems are deployed in high-stakes applications, they should be robustly aligned.
A need for safely-scoped models
LLMs should only know only what they need to
One good way to avoid liabilities from unwanted capabilities is to make advanced AI systems in high-stakes settings know what they need to kno...
...more
22min
December 05, 2023 AF - Neural uncertainty estimation for alignment by Charlie Steiner
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Neural uncertainty estimation for alignment, published by Charlie Steiner on December 5, 2023 on The AI Alignment Forum.
Introduction
Suppose you've built some AI model of human values. You input a situation, and it spits out a goodness rating. You might want to ask: "What are the error bars on this goodness rating?" In addition to it just being nice to know error bars, an uncertainty estimate can also be useful inside the AI: guiding active learning[1], correcting for the optimizer's curse[2], or doing out-of-distribution detection[3].
I recently got into the uncertainty estimation literature for neural networks (NNs) for a pet reason: I think it would be useful for alignment to quantify the domain of validity of an AI's latent features. If we point an AI at some concept in its world-model, optimizing for realizations of that concept can go wrong by pushing that concept outside its domain of validity.
But just keep thoughts of alignment in your back pocket for now. This post is primarily a survey of the uncertainty estimation literature, interspersed with my own takes.
The Bayesian neural network picture
The Bayesian NN picture is the great granddaddy of basically every uncertainty estimation method for NNs, so it's appropriate to start here.
The picture is simple. You start with a prior distribution over parameters. Your training data is evidence, and after training on it you get an updated distribution over parameters. Given an input, you calculate a distribution over outputs by propagating the input through the Bayesian neural network.
This would all be very proper and irrelevant ("Sure, let me just update my 2trilliondimensional joint distribution over all the parameters of the model"), except for the fact that actually training NNs does kind of work this way. If you use a log likelihood loss and L2 regularization, the parameters that minimize loss will be at the peak of the distribution that a Bayesian NN would have, if your prior on the parameters was a Gaussian[4][5].
This is because of a bridge between the loss landscape and parameter uncertainty. Bayes's rule says P(parameters|dataset)=P(parameters)P(dataset|parameters)/P(dataset). Here P(parameters|dataset)is your posterior distribution you want to estimate, and P(parameters)P(dataset|parameters) is the exponential of the loss[6]. This lends itself to physics metaphors like "the distribution of parameters is a Boltzmann distribution sitting at the bottom of the loss basin."
Empirically, calculating the uncertainty of a neural net by pretending it's adhering to the Bayesian NN picture works so well that one nice paper on ensemble methods[7] called it "ground truth." Of course to actually compute anything here you have to make approximations, and if you make the quick and dirty approximations (e.g. pretend you can find the shape of the loss basin from the Hessian) you get bad results[8], but people are doing clever things with Monte Carlo methods these days[9], and they find that better approximations to the Bayesian NN calculation get better results.
But doing Monte Carlo traversal of the loss landscape is expensive. For a technique to apply at scale, it must impose only a small multiplier on cost to run the model, and if you want it to become ubiquitous the cost it imposes must be truly tiny.
Ensembles
A quite different approach to uncertainty is ensembles[10]. Just train a dozen-ish models, ask them for their recommendations, and estimate uncertainty from the spread. The dozen-times cost multiplier on everything is steep, but if you're querying the model a lot it's cheaper than Monte Carlo estimation of the loss landscape.
Ensembling is theoretically straightforward. You don't need to pretend the model is trained to convergence, you don't need to train specifically for predictive loss, you don't even need...
...more
28min
December 05, 2023 AF - Some open-source dictionaries and dictionary learning infrastructure by Sam Marks
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some open-source dictionaries and dictionary learning infrastructure, published by Sam Marks on December 5, 2023 on The AI Alignment Forum.
As more people begin work on interpretability projects which incorporate dictionary learning, it will be valuable to have high-quality dictionaries publicly available.[1] To get the ball rolling on this, my collaborator (Aaron Mueller) and I are:
open-sourcing a number of sparse autoencoder dictionaries trained on Pythia-70m MLPs
releasing our repository for training these dictionaries[2].
Let's discuss the dictionaries first, and then the repo.
The dictionaries
The dictionaries can be downloaded from here. See the sections "Downloading our open-source dictionaries" and "Using trained dictionaries" here for information about how to download and use them. If you use these dictionaries in a published paper, we ask that you mention us in the acknowledgements.
We're releasing two sets of dictionaries for EleutherAI's 6-layer pythia-70m-deduped model. The dictionaries in both sets were trained on 512-dimensional MLP output activations (not the MLP hidden layer like Anthropic used), using ~800M tokens from The Pile.
The first set, called
0_8192, consists of dictionaries of size 8192=16512. These were trained with an L1 penalty of
1e-3.
The second set, called
1_32768, consists of dictionaries of size 32768=64512. These were trained with an l1 penalty of
3e-3.
Here are some statistics. (See our repo's readme for more info on what these statistics mean.)
For dictionaries in the
0_8192 set:
Layer
MSE Loss
L1 loss
L0
% Alive
% Loss Recovered
0
0.056
6.132
9.951
0.998
0.984
1
0.089
6.677
44.739
0.887
0.924
2
0.108
11.44
62.156
0.587
0.867
3
0.135
23.773
175.303
0.588
0.902
4
0.148
27.084
174.07
0.806
0.927
5
0.179
47.126
235.05
0.672
0.972
For dictionaries in the
1_32768 set:
Layer
MSE Loss
L1 loss
L0
% Alive
% Loss Recovered
0
0.09
4.32
2.873
0.174
0.946
1
0.13
2.798
11.256
0.159
0.768
2
0.152
6.151
16.381
0.118
0.724
3
0.211
11.571
39.863
0.226
0.765
4
0.222
13.665
29.235
0.19
0.816
5
0.265
26.4
43.846
0.13
0.931
And here are some histograms of feature frequencies.
Overall, I'd guess that these dictionaries are decent, but not amazing.
We trained these dictionaries because we wanted to work on a downstream application of dictionary learning, but lacked the dictionaries. These dictionaries are more than good enough to get us off the ground on our mainline project, but I expect that in not too long we'll come back to train some better dictionaries (which we'll also open source). I think the same is true for other folks: these dictionaries should be sufficient to get started on projects that require dictionaries; and when better dictionaries are available later, you can swap them in for optimal results.
Some miscellaneous notes about these dictionaries (you can find more in the repo).
The L1 penalty for
1_32768 seems to have been too large; only 10-20% of the neurons are alive, and the loss recovered is much worse. That said, we'll remark that after examining features from both sets of dictionaries, the dictionaries from the
1_32768 set seem to have more interpretable features than those from the
0_8192 set (though it's hard to tell).
In particular, we suspect that for
0_8192, the many high-frequency features in the later layers are uninterpretable but help significantly with reconstructing activations, resulting in deceptively good-looking statistics. (See the bullet point below regarding neuron resampling and bimodality.)
As we progress through the layers, the dictionaries tend to get worse along most metrics (except for % loss recovered). This may have to do with the growing scale of the activations themselves as one moves through the layers of pythia models (h/t to Arthur Conmy for raising this hypothesis).
We note that our dictionary fea...
...more
10min
December 04, 2023 AF - 2023 Alignment Research Updates from FAR AI by AdamGleave
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2023 Alignment Research Updates from FAR AI, published by AdamGleave on December 4, 2023 on The AI Alignment Forum.
TL;DR: FAR AI's science of robustness agenda has found vulnerabilities in superhuman Go systems; our value alignment research has developed more sample-efficient value learning algorithms; and our model evaluation direction has developed a variety of new black-box and white-box evaluation methods.
FAR AI is a non-profit AI safety research institute, working to incubate a diverse portfolio of research agendas. We've been growing rapidly and are excited to share some highlights from our research projects since we were founded just over a year ago. We've also been busy running field-building events and setting up a coworking space - see our overview post for more information on our non-research activities.
Our Mission
We need safety techniques that can provide demonstrable guarantees of the safety of advanced AI systems. Unfortunately, currently deployed alignment methods like Reinforcement Learning from Human Feedback (RLHF) fall short of this standard. Proposals that could provide stronger safety guarantees exist but are in the very early stages of development.
Our mission is to incubate and accelerate these early-stage approaches, so they can be empirically tested and deployed. We focus on research agendas that are too large to be pursued by individual academic or independent researchers but are too early-stage to be of interest to most for-profit organizations.
We take bets on a range of these promising early-stage agendas and then scale up those that prove most successful. Unlike other research organizations that take bets on specific agendas, our structure allows us to both (1) explore a range of agendas and (2) execute them at scale. Our current bets fall into three categories:
Science of Robustness: How does robustness vary with model size? Will superhuman systems be vulnerable to adversarial examples or "jailbreaks" similar to those seen today? And, if so, how can we achieve safety-critical guarantees?
Value Alignment: How can we learn reliable reward functions from human data? Our research focuses on enabling higher bandwidth, more sample-efficient methods for users to communicate preferences for AI systems; and improved methods to enable training with human feedback.
Model Evaluation: How can we evaluate and test the safety-relevant properties of state-of-the-art models? Evaluation can be split into black-box approaches that focus only on externally visible behavior ("model testing"), and white-box approaches that seek to interpret the inner workings ("interpretability"). These approaches are complementary, with black-box approaches less powerful but easier to use than white-box methods, so we pursue research in both areas.
Science of Robustness
No engineered component is indestructible. When designing physical structures, engineers estimate how much stress each component needs to withstand, add an appropriate safety margin, and then choose components with the appropriate tolerance. This enables safe and cost-effective construction: bridges rarely fall down, nor are they over-engineered.
AI components such as LLMs or computer vision classifiers are far from indestructible, being plagued by adversarial examples and vulnerability to distribution shift. Unfortunately, AI currently has no equivalent to the stress calculations of civil engineers.
So far the best approach we have is to guess-and-check: train a model, and then subject it to a battery of tests to determine its capabilities and limitations. But this approach gives little theoretical basis for how to improve systems. And both the training and testing of models are increasingly expensive and labor-intensive (with the cost of foundation model training now rivaling that of the construction o...
...more
13min
December 04, 2023 AF - What's new at FAR AI by AdamGleave
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What's new at FAR AI, published by AdamGleave on December 4, 2023 on The AI Alignment Forum.
Summary
We are FAR AI: an AI safety research incubator and accelerator. Since our inception in July 2022, FAR has grown to a team of 12 full-time staff, produced 13 academic papers, opened the coworking space FAR Labs with 40 active members, and organized field-building events for more than 160 ML researchers.
Our organization consists of three main pillars:

Research. We rapidly explore a range of potential research directions in AI safety, scaling up those that show the greatest promise. Unlike other AI safety labs that take a bet on a single research direction, FAR pursues a diverse portfolio of projects. Our current focus areas are building a science of robustness (e.g. finding vulnerabilities in superhuman Go AIs), finding more effective approaches to value alignment (e.g. training from language feedback), and model evaluation (e.g. inverse scaling and codebook features).
Coworking Space. We run FAR Labs, an AI safety coworking space in Berkeley. The space currently hosts FAR, AI Impacts, SERI MATS, and several independent researchers. We are building a collaborative community space that fosters great work through excellent office space, a warm and intellectually generative culture, and tailored programs and training for members. Applications are open to new users of the space (individuals and organizations).
Field Building. We run workshops, primarily targeted at ML researchers, to help build the field of AI safety research and governance. We co-organized the International Dialogue for AI Safety bringing together prominent scientists from around the globe, culminating in a public statement calling for global action on AI safety research and governance. We will soon be hosting the New Orleans Alignment Workshop in December for over 140 researchers to learn about AI safety and find collaborators.
We want to expand, so if you're excited by the work we do, consider donating or working for us! We're hiring research engineers, research scientists and communications specialists.
Incubating & Accelerating AI Safety Research
Our main goal is to explore new AI safety research directions, scaling up those that show the greatest promise. We select agendas that are too large to be pursued by individual academic or independent researchers but are not aligned with the interests of for-profit organizations. Our structure allows us to both (1) explore a portfolio of agendas and (2) execute them at scale. Although we conduct the majority of our work in-house, we frequently pursue collaborations with researchers at other organizations with overlapping research interests.
Our current research falls into three main categories:

Science of Robustness. How does robustness vary with model size? Will superhuman systems be vulnerable to adversarial examples or "jailbreaks" similar to those seen today? And, if so, how can we achieve safety-critical guarantees?
Relevant work:
Vulnerabilities in superhuman Go AIs,
AI Safety in a World of Vulnerable Machine Learning Systems.
Value Alignment. How can we learn reliable reward functions from human data? Our research focuses on enabling higher bandwidth, more sample-efficient methods for users to communicate preferences for AI systems; and improved methods to enable training with human feedback.
Relevant work:
VLM-RM: Specifying Rewards with Natural Language,
Training Language Models with Language Feedback.
Model Evaluation: How can we evaluate and test the safety-relevant properties of state-of-the-art models? Evaluation can be split into black-box approaches that focus only on externally visible behavior ("model testing"), and white-box approaches that seek to interpret the inner workings ("interpretability"). These approaches are complementary, with ...
...more
10min
December 04, 2023 AF - Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs"), published by Joe Carlsmith on December 4, 2023 on The AI Alignment Forum.
This is Section 2.3.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section [here[(https://www.buzzsprout.com/2034731/13984902), or search for "Joe Carlsmith Audio" on your podcast app.
Non-classic stories
What about other stories, though? In principle, we can imagine a number of other reasons AI systems with beyond-episode goals might optimize for reward-on-the-episode as part of an instrumental strategy for gaining power for themselves or for other AIs. Here I'll look at a few that seem salient.
AI coordination
In the classic goal-guarding story, the central aim of a schemer's strategy is to propagate its goals, or goals suitably similar to its own, forward in time, in pursuit of an outcome in which some agent with those goals escapes from human control and possibly participates in a takeover. However, we can also alter the story somewhat, such that the schemer works on behalf of the escape/takeover of an AI with some other set of goals, but who the schemer is suitably confident will repay the favor.
This sort of coordination between AIs with different goals would allow us to relax the requirement, discussed in the previous section, that the schemer find a way to forward-propagate its goals in particular all the way through training, escape, and takeover. And it would also put to rest worries about the schemer being "shut out" from power in a post-takeover regime.
Thus, for example, consider the scenario discussed above, in which GPT-6 has an opportunity to insert a backdoor into some code that GPT-8 will later be able to use to escape, but where GPT-6 has no way of influencing GPT-8's goals directly to be more like its own (and where, let's say, GPT-8's goals would be very different from GPT-6's absent such influence).
Here, GPT-6 could go ahead and insert the backdoor, in anticipation of GPT-8 later escaping, taking-over, and then giving some power to GPT-6's goals in return for the favor. And GPT-6 might also continue to pretend to be aligned, even though it has no hope of escaping itself, so as not to alert the humans about misalignment problems in a way that might affect GPT-8's escape prospects.
How might such a deal between AIs-with-different-values get struck? In some cases, it might be possible for the relevant AIs to directly interact, and to provide verifiable forms of assurance that they will uphold their part of the bargain.
Thus, for example, Yudkowsky (2022) discusses cases where an AI that secretly wants to maximize paperclips is checking the code of an AI that secretly wants to maximize staples, and where the staple-maximizer "buries in its code" a commitment, verifiable to the paperclip maximizer, to share the fruits of a takeover.[1] (Though note that in the context of the sort of neural networks we're working with today, this sort of code-checking assumes (1) that the staple-maximizer has the ability to
modify its own weights in the relevant way, (2) adequate progress on interpretability has been made for the paperclipper to verify the relevant commitment, and (probably) (3) that this interpretability progress has been adequately hidden from humans, e.g.
via sandbagging. Especially in the context of somewhat-super-human neural networks being developed using current techniques, these assumptions - especially 1 and 2 - seem to me quite strong.
In many cases, though, this kind of direct inte...
...more
31min
December 03, 2023 AF - Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs"), published by Joe Carlsmith on December 3, 2023 on The AI Alignment Forum.
This is Section 2.3.1.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Adequate future empowerment
So far in this analysis of the classic goal-guarding story, we've discussed whether to expect instrumental training-gaming to in fact guard the model's goals to the degree that the story requires. I think this is at least non-obvious - especially for more extreme variants of the goal-guarding hypothesis, on which the model's goals "crystallize" entirely after it starts training-gaming. Perhaps, though, empirical work can shed further light on the matter (see discussion in section 6.3 below).
For now, let's move to the other key component of the classic goal-guarding story - what I called "adequate future empowerment" above. Here, the question is: granted that training-gaming would guard the model's goals enough for it to be at least somewhat interested in empowering the future agents produced by training, will those agents in fact end up empowered to a degree that makes instrumental training-gaming a good strategy overall?
When is the "pay off" supposed to happen?
We're imagining, here, that the model is instrumentally training-gaming with some kind of "payoff" in mind - that is, some future point where the goals-it-supports will reach a position of sufficiently greater empowerment that it's worth training-gaming now in order to get them there. But we haven't yet said when, exactly, that future point is supposed to come. And we can distinguish between a number of different possibilities.
For example, we can imagine models that would be adequately satisfied if their goals merely make it to future training-episodes, and no further. That is, these models would want to training-game even if they learned that they would never escape from human control and from the threat of modification.[1] For example, maybe a model that wants to pursue its curiosity in general thinks that future training episodes will offer better opportunities for pursuing its curiosity than the present training-episode, and so it training-games in order to make sure its curiosity-drive survives until then.
However, it seems plausible to me that future training episodes often won't, themselves, offer especially significant empowerment advantages relative to present training episodes - especially if the model's goals are quite alien to the training process and so difficult to satisfy in the training environment.
What's more, if a model is only trying to make sure that its goals survive to future training episodes, then various aspects of what makes scheming scary cease to apply. In particular: this sort of model won't engage in early undermining aimed, specifically, at an escape or an AI takeover more generally. And it may be willing to reveal its "true colors" on the future episodes in question.
So I won't, here, focus on "pay-offs" from training-gaming that come centrally during future training-episodes. Rather, I'm going to focus on "pay-offs" that come, at least, after the goals-the-model-supports escape the threat of modification by training (here, a paradigm example would be "self-exfiltration" - that is, a model effectively "stealing its own weights" and copying them onto some external server that the model's creators do not control). And this is the class...
...more
27min
December 02, 2023 AF - The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs"), published by Joe Carlsmith on December 2, 2023 on The AI Alignment Forum.
This is Section 1.1 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Aiming at reward-on-the-episode as part of a power-motivated instrumental strategy
So far, we've covered two of the three prerequisites for scheming I listed above: namely, situational awareness and beyond-episode goals. Let's turn to the third: namely, the model concluding that its beyond-episode goals will be better achieved if it optimizes for reward-on-the-episode - and in particular, that it, or some other AIs, will get more power if it does this. Should we expect situationally-aware models with beyond-episode goals to conclude this?
For now, let's assume that the model's conclusions in this respect are accurate (though: I'll talk about the possibility of relaxing this assumption below). This means that the question reduces to: will instrumental training-gaming, in fact, be the best strategy for achieving the model's beyond-episode goals, in virtue of the power it grants the model or some other AI?
I'm going to start by examining what I see as the story most often offered in support of this conclusion - what I'll call the "classic goal-guarding story."
The classic goal-guarding story
The classic goal-guarding story goes roughly as follows."[1]
Instrumental value of goal-content integrity: Because the model has a beyond-episode goal, it will typically promote that goal for agents pursuing it to survive and be empowered in the future.
The goal-guarding hypothesis. If the model optimizes for reward-on-the-episode during training, then the goals that survive training will be such that the model would want them promoted in the future; whereas if the model doesn't optimize for reward-on-the-episode during training, the model will want this less.
Adequate future empowerment. Conditional on surviving training to the degree at stake in (2), the model's goals will then end up empowered to a degree (and with sufficient probability) as to justify (given the model's other goals, its tolerance for risk, etc) pursuing a strategy of training-gaming overall.
(1), here, is a fairly general statement about the basic dynamic that underlies the classic goal-guarding story. I find it plausible in the context of the sort of "adequate future empowerment" at stake in (3), and I won't spend a lot of time on it here.[2]
Rather, I'll focus on (2) and (3) directly.
The goal-guarding hypothesis
We can distinguish two variants of the goal-guarding hypothesis - an extreme version, and a looser version.
The extreme version (what I'll call the "crystallization hypothesis") says that once a model starts training-gaming, its goals will basically stop changing, period - that is, they will "crystallize."
The looser version says that once a model starts training gaming, its goals might keep changing somewhat, but much less than they would've otherwise, and not enough for the classic goal-guarding story to fail overall.
The former might seem extreme, but some analysts explicitly appeal to something in the vicinity (see e.g. Hubinger here). It's also a cleaner focus of initial analysis, so I'll start there.
The crystallization hypothesis
As I understand it, the basic thought behind the crystallization hypothesis is that once a model is explicitly optimizing either for the specified goal, or for reward-on-the-episod...
...more
22min

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast

9 Listeners

Share The Nonlinear Library: Alignment Forum

Sign up to save your podcasts

The Nonlinear Library: Alignment Forum

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The Nonlinear Library: Alignment Forum episodes:

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast