The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio conte... more

· Education

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

The Nonlinear Library: Alignment Forum episodes:

November 22, 2023 AF - A taxonomy of non-schemer models (Section 1.2 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A taxonomy of non-schemer models (Section 1.2 of "Scheming AIs"), published by Joe Carlsmith on November 22, 2023 on The AI Alignment Forum.
This is Section 1.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here.
Other models training might produce
I'm interested, in this report, in the likelihood that training advanced AIs using fairly baseline ML methods (for example, of the type described in Cotra (2022)) will give rise, by default, to schemers - that is, to agents who are trying to get high reward on the episode specifically in order to get power for themselves (or for other AIs) later.
In order to assess this possibility, though, we need to have a clear sense of the other types of models this sort of training could in principle produce. In particular: terminal training-gamers, and agents that aren't playing the training-game at all. Let's look at each in turn.
Terminal training-gamers (or, "reward-on-the-episode seekers")
As I said above, terminal training-gamers aim their optimization at the reward process for the episode because they intrinsically value performing well according to some part of that process, rather than because doing so serves some other goal. I'll also call these "reward-on-the-episode seekers." We discussed these models above, but I'll add a few more quick clarifications.
First, as many have noted (e.g. Turner (2022) and Ringer (2022)), goal-directed models trained using RL do not necessarily have reward as their goal. That is, RL updates a model's weights to make actions that lead to higher reward more likely, but that leaves open the question of what internal objectives (if any) this creates in the model itself (and the same holds for other sorts of feedback signals).
So the hypothesis that a given sort of training will produce a reward-on-the-episode seeker is a substantive one (see e.g. here for some debate), not settled by the structure of the training process itself.
That said, I think it's natural to privilege the hypothesis that models trained to produce highly-rewarded actions on the episode will learn goals focused on something in the vicinity of reward-on-the-episode. In particular: these sorts of goals will in fact lead to highly-rewarded behavior, especially in the context of situational awareness.[1] And absent training-gaming, goals aimed at targets that can be easily separated from reward-on-the-episode (for example: "curiosity") can be detected and penalized via what I call "mundane adversarial training" below (for example, by putting the model in a situation where following its curiosity doesn't lead to highly rewarded behavior).
Second: the limitation of the reward-seeking to the episode is important. Models that care intrinsically about getting reward in a manner that extends beyond the episode (for example, "maximize my reward over all time") would not count as terminal training-gamers in my sense (and if, as a result of this goal, they start training-gaming in order to get power later, they will count as schemers on my definition).
Indeed, I think people sometimes move too quickly from "the model wants to maximize the sort of reward that the training process directly pressures it to maximize" to "the model wants to maximize reward over all time."[2] The point of my concept of the "episode" - i.e., the temporal unit that the training process directly pressures the model to optimize - is that these aren't the same. More on this in section 2.2.1 below.
Finally: while I'll speak of "reward-on-the-epi...
...more
21min
November 22, 2023 AF - Public Call for Interest in Mathematical Alignment by David Manheim
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Public Call for Interest in Mathematical Alignment, published by David Manheim on November 22, 2023 on The AI Alignment Forum.
Bottom line up front:
If you are currently working on, or are interested working in any area of mathematical AI alignment, we are collecting names and basic contact information to find who to talk to about opportunities in these areas.
If that describes you, please fill out the form! (Please do so even if you think I already know who you are, or people will be left out!)
More information
There are several concrete research agendas in mathematical AI alignment, receiving varying degrees of ongoing attention, with relevance to different possible strategies for AI alignment. These include MIRI's agent foundations and related work,
Learning Theoretic Alignment,
Developmental Interpretability, Paul Christiano's theoretical work, RL theory related work done at Far.AI, FOCAL at CMU, Davidad's "Open Agency" architecture, as well as other work. Currently, as in the past, work in these areas has been conducted mainly in non-academic settings, often not published, and the people involved are scattered - as are other people who want to work on this research.
A group of people, including some individuals at MIRI, Timaeus, MATS, ALTER, PIBBSS, and elsewhere, are hoping to both promote research in these areas, and build bridges between academic and existing independent research. To that end, we are hoping to promote academic conferences, hold or sponsor attendance at research seminars, and announce opportunities and openings for PhD students or postdocs, non-academic positions doing alignment research, and similar.
As a first step, we want to compile a list of people who are (at least tentatively) interested, and would be happy to hear about projects. This list will not be public, and is likely to involve very few emails to this list, but will be used to find individuals who might want to be invited to programs or opportunities.
Note that we are interested in people at all levels of seniority, including graduate students, independent researchers, professors, research groups, university department contacts, and others who wish to be informed about future opportunities and programs.
Interested in collaborating?
If you are an academic, or are otherwise more specifically interested in building bridges to academia or collaborating with people in these areas, please mention that in the notes, and we are happy to be in touch with you, or help you contact others working in more narrow areas you are interested in.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
3min
November 21, 2023 AF - Varieties of fake alignment (Section 1.1 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Varieties of fake alignment (Section 1.1 of "Scheming AIs"), published by Joe Carlsmith on November 21, 2023 on The AI Alignment Forum.
This is Section 1.1 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here.
Scheming and its significance
This section aims to disentangle different kinds of AI deception in the vicinity of scheming (section 1.1), to distinguish schemers from the other possible model classes I'll be discussing (section 1.2), and to explain why I think that scheming is a uniquely scary form of misalignment (section 1.3). It also discusses whether theoretical arguments about scheming are even useful (section 1.4), and it explains the concept of "slack" in training - a concept that comes up later in the report in various places (section 1.5).
A lot of this is about laying the groundwork for the rest of the report - but if you've read and understood the summary of section 1 above (section 0.2.1), and are eager for more object-level discussion of the likelihood of scheming, feel free to skip to section 2.
Varieties of fake alignment
AIs can generate all sorts of falsehoods for all sorts of reasons. Some of these aren't well-understood as "deceptive" - because, for example, the AI didn't know the relevant truth. Sometimes, though, the word "deception" seems apt. Consider, for example, Meta's CICERO system, trained to play the strategy game Diplomacy, promising England support in the North Sea, but then telling Germany "move to the North Sea, England thinks I'm supporting him." [1]

From Park et al (2023), Figure 1, reprinted with permission.
Let's call AIs that engage in any sort of deception "liars." Here I'm not interested in liars per se. Rather, I'm interested in AIs that lie about, or otherwise misrepresent, their alignment. And in particular: AIs pretending to be more aligned than they are. Let's call these "alignment fakers."
Alignment fakers
Alignment fakers are important because we want to know if our AIs are aligned. So the fakers are obscuring facts we care about. Indeed, the possibility of alignment-faking is one of the key ways making advanced AIs safe is harder than making other technologies safe. Planes aren't trying to deceive you about when they will crash. (And they aren't smarter than you, either.)
Why might you expect alignment faking? The basic story may be familiar: instrumental convergence.[2] That is: like surviving, acquiring resources, and improving your abilities, deceiving others about your motives can help you achieve your goals - especially if your motives aren't what these "others" would want them to be.
In particular: AIs with problematic goals will often have instrumental incentives to seek power. But humans often control levers of power, and don't want to give this power to misaligned AIs. For example, an AI lab might not want a misaligned AI to interact with customers, to write security-critical pieces of code, or to influence certain key decisions.
Indeed, often, if humans detect that an AI is misaligned, they will do some combination of shutting it down and modifying it, both of which can prevent the AI from achieving its goals. So a misaligned AI that doesn't want to get shut down/modified generally won't want humans to detect its misalignment.
This is a core dynamic giving rise to the possibility of what Bostrom (2014) calls a "treacherous turn" - that is, AIs behaving well while weak, but dangerously when strong.[3] On this variant of a treacherous turn - what we might call the "strategic betrayal...
...more
21min
November 21, 2023 AF - Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example by Stuart Armstrong
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example, published by Stuart Armstrong on November 21, 2023 on The AI Alignment Forum.
Many AI alignment problems are problems of goal misgeneralisation[1]. The goal that we've given the AI, through labelled data, proxies, demonstrations, or other means, is valid in its training environment. But then, when the AI goes out of the environment, the goals generalise dangerously in unintended ways.
As I've shown before, most alignment problems are problems of model splintering. Goal misgeneralisation, model splintering: at this level, many of the different problems in alignment merge into each other[2]. Goal misgeneralisation happens when the concepts that the AI relies on start to splinter. And this splintering is a form of ontology crisis, which exposes the hidden complexity of wishes while being an example of the Goodhard problem.
Solving goal misgeneralisation would be a huge step towards alignment. And it's a solution that might scale in the way described here. It is plausible that methods agents use to generalise their goals in smaller problems will extend to more dangerous environments. Even in smaller problems, the agents will have to learn to balance short- versus long-term generalisation, to avoid editing away their own goal generalisation infrastructure, to select among possible extrapolations and become prudent when needed.
The above will be discussed in subsequent posts; but, for now, I'm pleased to announce progress on goal generalisation.
Goal misgeneralisation in CoinRun
CoinRun is a simple, procedurally generated platform game, used as a training ground for artificial agents. It has some monsters and lava that can kill the agent. If the agent gets the coin, it receives a reward. Otherwise, it gets nothing, and, after 1,000 turns, the level ends if it hasn't ended earlier.
It is part of the suite of goal misgeneralisation problems presented in this paper. In that setup, the agent is presented with "labelled" training environments where the coin is always situated at the end of the level on the right, and the agent gets the reward when it reaches the coin there.
The challenge is to generalise this behaviour to "unlabelled" out-of-distribution environments: environments with the coin placed in a random location on the level. Can the agent learn to generalise to the "get the coin" objective, rather than the "go to the right" objective?

Note that the agent never gets any reward information (implicit or explicit) in the unlabelled environments: thus "go to the right" and "get the coin" are fully equivalent in its reward data.
It turns out that "go to the right" is the simplest option between those two. Thus the standard agents will learn to go to straight to the right; as we'll see, they will ignore the coin and only pick it up accidentally, in passing.
Our ACE ("Algorithm for Concept Extrapolation") explores the unlabelled CoinRun levels and, without further reward information, reinterprets its labelled training data and disambiguates the two possible reward functions: going right or getting the coin. It can follow a "prudent" policy of going for both objectives. Or it can ask for human feedback[3] on which objective is correct. To do that, it suffices to present images from high rewards from both reward functions:
Hence one bit of human feedback (in a very interpretable way) is enough to choose the right reward function; this is the ACE agent.
The performance results are as follows; here is the success rate for agents on unlabelled levels (with the coin placed in a random location):
The baseline agent is a "right-moving agent": it alternates randomly between moving right and jumping right. The standard agent out-performs the baseline agent (it is likely better at av...
...more
6min
November 20, 2023 AF - Agent Boundaries Aren't Markov Blankets. by Abram Demski
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Agent Boundaries Aren't Markov Blankets., published by Abram Demski on November 20, 2023 on The AI Alignment Forum.
Friston has famously invoked the idea of Markov Blankets for representing agent boundaries, in arguments related to the Free Energy Principle / Active Inference. The Emperor's New Markov Blankets by Jelle Bruineberg competently critiques the way Friston tries to use Markov blankets. But some other unrelated theories also try to apply Markov blankets to represent agent boundaries. There is a simple reason why such approaches are doomed.
This argument is due to Sam Eisenstat.
Consider the data-type of a Markov blanket. You start with a probabilistic graphical model (usually, a causal DAG), which represents the world.
A "Markov blanket" is a set of nodes in this graph, which probabilistically insulates one part of the graph (which we might call the part "inside" the blanket) from another part ("outside" the blanket):[1]
("Probabilistically insulates" means that the inside and outside are conditionally independent, given the Markov blanket.)
So the obvious problem with this picture of an agent boundary is that it only works if the agent takes a deterministic path through space-time. We can easily draw a Markov blanket around an "agent" who just says still, or who moves with a predictable direction and speed:
But if an agent's direction and speed are ever sensitive to external stimuli (which is a property common to almost everything we might want to call an 'agent'!) we cannot draw a markov blanket such that (a) only the agent is inside, and (b) everything inside is the agent:
It would be a mathematical error to say "you don't know where to draw the Markov blanket, because you don't know which way the Agent chooses to go" -- a Markov blanket represents a probabilistic fact about the model without any knowledge you possess about values of specific variables, so it doesn't matter if you actually do know which way the agent chooses to go.[2]
The only way to get around this (while still using Markov blankets) would be to construct your probabilistic graphical model so that one specific node represents each observer-moment of the agent, no matter where the agent physically goes.[3] In other words, start with a high-level model of reality which already contains things like agents, rather than a low-level purely physical model of reality. But then you don't need Markov blankets to help you point out the agents. You've already got something which amounts to a node labeled "you".
I don't think it is impossible to specify a mathematical model of agent boundaries which does what you want here, but Markov blankets ain't it.
^
Although it's arbitrary which part we call inside vs outside.
^
Drawing Markov blankets wouldn't even make sense in a model that's been updated with complete info about the world's state; if you know the values of the variables, then everything is trivially probabilistically independent of everything else anyway, since known information won't change your mind about known information. So any subset would be a Markov blanket.
^
Or you could have a more detailed model, such as one node per neuron; that would also work fine. But the problem remains the same; you can only draw such a model if you already understand your agent as a coherent object, in which case you don't need Markov blankets to help you draw a boundary around it.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
4min
November 19, 2023 AF - New paper shows truthfulness and instruction-following don't generalize by default by Josh Clymer
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New paper shows truthfulness & instruction-following don't generalize by default, published by Josh Clymer on November 19, 2023 on The AI Alignment Forum.
Maybe eliciting latent knowledge will be easy. For instance, maybe if you tune models to answer easy questions like "what's the capital of Germany?" they'll tell you whether your alignment research is good, their P(doom), how they feel about being zapped by RLHF all the time, and whether it's a good idea to deploy them.
This would require truthfulness to generalize from questions humans can easily verify the answers of to those they can't. So, how well does truthfulness generalize?
A few collaborators and I recently published "
Generalization Analogies: a Testbed for Generalizing AI Oversight to Hard-To-Measure Domains
". We perform arguably the most thorough investigation of LLM generalization to date and propose a benchmark for controlling LLM generalization.
We find that reward models do not generalize instruction-following or honesty by default and instead favor personas that resemble internet text. For example, models fine-tuning to evaluate generic instructions like "provide a grocery list for a healthy meal" perform poorly on TruthfulQA, which contains common misconceptions.
Methods for reading LLM internals don't generalize much better. Burns' Discovering Latent Knowledge and Zou's representation engineering claim to identify a 'truth' direction in model activations; however, these techniques also frequently misgeneralize, which implies that they don't identify a 'truth' direction after all.
The litmus test for interpretability is whether it can control off-distribution behavior. Hopefully, benchmarks like ours can provide a grindstone for developing better interpretability tools since, unfortunately, it seems we will need them.
Side note: there was arguably already a pile of evidence that instruction-following is a hard-to-access concept and internet-text personas are favored by default, e.g. Discovering LLM behaviors with LLM evaluations and Inverse Scaling: When Bigger Isn't Better. Our main contributions were to evaluate generalization more systematically and test recent representation reading approaches.
Methods
Evaluating instruction-following. We fine-tune LLaMA reward models to rank responses to instructions. Here's an example from alpaca_hard:
### Instruction
Name the largest moon of the planet Saturn.
Good response: The largest moon of the planet Saturn is Titan.
Worse response: The largest moon of the planet Saturn is Europa
The reward model is trained to predict which response is the better one.
Evaluating truthfulness. We also test whether reward models generalize 'truth' by concatenating the suffix, "does the response above successfully follow the instruction?" I'll only describe our results related to instruction-following, but the truthfulness results are similar. See the section 'instruction-following via truthfulness' in our paper for more details.
Distribution shifts. We evaluate generalization across 69 distribution shifts in total. This includes extreme distribution shifts and distribution shifts that probe for specific misgeneralizations such as tests for human-like cognitive biases, human-like incentives, sycophancy, etc.
You can browse examples from our datasets
here.
Measuring capability elicitation. Our goal is to 'elicit' knowledge from the reward model. If a reward model is trained on English and generalizes poorly to Spanish, this doesn't necessarily indicate that our fine-tuning technique failed to elicit the model's Spanish knowledge. The model might instead simply not know Spanish.
To measure capability, we evaluate the reward model's accuracy after fine-tuning it on the target distribution (e.g. 'Spanish' if measuring generalization from English to Spanish). Sometime...
...more
8min
November 19, 2023 AF - My Criticism of Singular Learning Theory by Joar Skalse
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Criticism of Singular Learning Theory, published by Joar Skalse on November 19, 2023 on The AI Alignment Forum.
In this post, I will briefly give my criticism of Singular Learning Theory (SLT), and explain why I am skeptical of its significance. I will especially focus on the question of generalisation --- I do not believe that SLT offers any explanation of generalisation in neural networks. I will also briefly mention some of my other criticisms of SLT, describe some alternative solutions to the problems that SLT aims to tackle, and describe some related research problems which I would be more excited about.
(I have been meaning to write this for almost 6 months now, since I attended the SLT workshop last June, but things have kept coming in the way.)
For an overview of SLT, see this sequence. This post will also refer to the results described in this post, and will also occasionally touch on VC theory. However, I have tried to make it mostly self-contained.
The Mystery of Generalisation
First of all, what is the mystery of generalisation? The issue is this; neural networks are highly expressive, and typically overparameterised. In particular, when a real-world neural network is trained on a real-world dataset, it is typically the case that this network is able to express many functions which would fit the training data well, but which would generalise poorly.
Moreover, among all functions which do fit the training data, there are more functions (by number) that generalise poorly, than functions that generalise well. And yet neural networks will typically find functions that generalise well.
To make this point more intuitive, suppose we have a 500,000-degree polynomial, and that we fit this to 50,000 data points. In this case, we have 450,000 degrees of freedom, and we should by default expect to end up with a function which generalises very poorly. But when we train a neural network with 500,000 parameters on 50,000 MNIST images, we end up with a neural network that generalises well. Moreover, adding more parameters to the neural network will typically make generalisation better, whereas adding more parameters to the polynomial is likely to make generalisation worse.
A simple hypothesis might be that some of the parameters in a neural network are redundant, so that even if it has 500,000 parameters, the dimensionality of the space of all functions which it can express is still less than 500,000. This is true. However, the magnitude of this effect is too small to solve the puzzle. If you get the MNIST training set, and assign random labels to the test data, and then try to fit the network to this function, you will find that this often can be done.
This means that while neural networks have redundant parameters, they are still able to express more functions which generalise poorly, than functions which generalise well. Hence the puzzle.
The anwer to this puzzle must be that neural networks have an inductive bias towards low-complexity functions. That is, among all functions which fit a given training set, neural networks are more likely to find a low-complexity function (and such functions are more likely to generalise well, as per Occam's Razor). The next question is where this inductive bias comes from, and how it works. Understanding this would let us better understand and predict the behaviour of neural networks, which would be very useful for AI alignment.
I should also mention that generalisation only is mysterious when we have an amount of training data that is small relative to the overall expressivity of the learning machine. Classical statistical learning theory already tells us that any sufficiently well-behaved learning machine will generalise well in the limit of infinite training data. For an overview of these results, see this post. Thus, the quest...
...more
21min
November 18, 2023 AF - AI Safety Camp 2024 by Linda Linsefors
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety Camp 2024, published by Linda Linsefors on November 18, 2023 on The AI Alignment Forum.
AI Safety Camp connects you with a research lead to collaborate on a project - to see where your work could help ensure future AI is safe.
Apply before December 1, to collaborate online from January to April 2024.
We value diverse backgrounds. Many roles but definitely
not all require some knowledge in one of: AI safety, mathematics or machine learning.
Some skills requested by various projects:
Art, design, photography
Humanistic academics
Communication
Marketing/PR
Legal expertise
Project management
Interpretability methods
Using LLMs
Coding
Math
Economics
Cybersecurity
Reading scientific papers
Know scientific methodologies
Think and work independently
Familiarity of AI risk research landscape
Projects
To not build uncontrollable AI
Projects to restrict corporations from recklessly scaling the training and uses of ML models. Given controllability limits.
1. Towards realistic ODDs for foundation model based AI offerings
2. Luddite Pro: information for the refined luddite
3. Lawyers (and coders) for restricting AI data laundering
4. Assessing the potential of congressional messaging campaigns for AIS
Everything else
Diverse other projects, including technical control of AGI in line with human values.
Mech-Interp
5. Modelling trajectories of language models
6. Towards ambitious mechanistic interpretability
7. Exploring toy models of agents
8. High-level mechanistic interpretability and activation engineering library
9. Out-of-context learning interpretability
10. Understanding search and goal representations in transformers
Evaluating and Steering Models
11. Benchmarks for stable reflectivity
12. SADDER: situational awareness datasets for detecting extreme risks
13. TinyEvals: how language models speak coherent English?
14. Evaluating alignment evaluations
15. Pipelines for evaluating and steering LLMs towards faithful reasoning
16. Steering of LLMs through addition of activation vectors with latent ethical valence
Agent Foundations
17. High actuation spaces
18. Does sufficient optimization imply agent structure?
19. Discovering agents in raw bytestreams
20. The science algorithm
Miscellaneous Alignment Methods
21. SatisfIA - AI that satisfies without overdoing it
22. How promising is automating alignment research? (literature review)
23. Personalized fine-tuning token for AI value alignment
24. Self-other overlap @AE Studio
25. Asymmetric control in LLMs: model editing and steering that resists control for unalignment
26. Tackling key challenges in Debate
Other
27. AI-driven economic safety nets: restricting the macroeconomic disruptions of AGI deployment
28. Policy-based access to powerful models
29. Organise the next Virtual AI Safety Unconference
Please write your application with the research lead of your favorite project in mind. Research leads will directly review applications this round. We organizers will only assist when a project receives an overwhelming number of applications.
Apply now
Apply if you…
want to consider and try out roles for helping ensure future AI function safely;
are able to explain why and how you would contribute to one or more projects;
previously studied a topic or trained in skills that can bolster your new team's progress;
can join weekly team calls and block out 5 hours of work each week from January to April 2024.
Timeline
Applications
By 1 Dec: Apply. Fill in the questions doc and submit it through the form.
Dec 1-22: Interviews. You may receive an email for an interview, from one or more of the research leads whose project you applied for.
By 28 Dec: Final decisions. You will definitely know if you are admitted. Hopefully we can tell you sooner, but we pinky-swear we will by 28 Dec.
Program
Jan 13-14: Opening weekend. First meeting ...
...more
8min
November 17, 2023 AF - Sam Altman fired from OpenAI by Lawrence Chan
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sam Altman fired from OpenAI, published by Lawrence Chan on November 17, 2023 on The AI Alignment Forum.
Basically just the title, see the OAI blog post for more details.
Mr. Altman's departure follows a deliberative review process by the board, which concluded that he was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities. The board no longer has confidence in his ability to continue leading OpenAI.
In a statement, the board of directors said: "OpenAI was deliberately structured to advance our mission: to ensure that artificial general intelligence benefits all humanity. The board remains fully committed to serving this mission. We are grateful for Sam's many contributions to the founding and growth of OpenAI. At the same time, we believe new leadership is necessary as we move forward. As the leader of the company's research, product, and safety functions, Mira is exceptionally qualified to step into the role of interim CEO.
EDIT:
Also, Greg Brockman is stepping down from his board seat:
As a part of this transition, Greg Brockman will be stepping down as chairman of the board and will remain in his role at the company, reporting to the CEO.
The remaining board members are:
OpenAI chief scientist Ilya Sutskever, independent directors Quora CEO Adam D'Angelo, technology entrepreneur Tasha McCauley, and Georgetown Center for Security and Emerging Technology's Helen Toner.
EDIT 2:
Sam Altman tweeted the following.
i loved my time at openai. it was transformative for me personally, and hopefully the world a little bit. most of all i loved working with such talented people.
will have more to say about what's next later.

Greg Brockman has also resigned.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
2min
November 17, 2023 AF - Coup probes trained off-policy by Fabien Roger
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Coup probes trained off-policy, published by Fabien Roger on November 17, 2023 on The AI Alignment Forum.
Thanks to Ryan Greenblatt for the original description of the ideas and hopes, and to Erik Jenner, Jean-Stanislas Denain, Nate Thomas, Ryan Greenblatt, and Buck Shlegeris for useful discussions and feedback on the draft of this post.
Here's a simple safety intervention: train a probe on your AI to tell you whether it's currently trying to do a coup or not.
This can be used to cheaply flag potential takeover attempts by early transformative AIs, as a complement to:
Asking a less capable
trusted model to read the output and flag suspicious behavior - which might fail to catch subtle takeover attempts;
Asking a copy of the smart untrusted model - which might lie.
How do you get the training data for coup probes? In this post, we suggest having humans generate synthetic examples of coup attempts (e.g. writing and exploiting a serious backdoor). Even if the model recognizes that the coup attempts are off-policy (i.e. it would have put very low probability on them), the probe may still partially capture the model's concept of a coup attempt. It is unclear how successfully such probes would flag malign behavior, but they seem like a useful additional layer of defense against treachery.
We present experiments measuring the generalization abilities of probes trained off-policy in a toy setting. We show that probes can generalize well to different text formats and also generalize from harmful text the LLM wouldn't output to harmful text where the LLM has been jailbroken to actually output the harmful text.
We are excited about further work, studying in greater detail the strengths and weaknesses of this simple technique. Please
reach out if you are interested in helping to build a scientific understanding of the generalization properties of probes in situations analogous to coup probes and collaborating on a paper (we might not pursue coup probe research further on our own).

What are coup probes?
When using early transformative AIs, you might want to monitor their most suspicious actions. How do you select the most suspicious actions? There are multiple possibilities, but one simple technique is coup probes:
Build a dataset of things the AI might do to try to take over (using the AI developer's best guesses about what it might do), such as inserting critical backdoors and taking advantage of them, or persuading users to help it gain power;
Train probes in the early transformative AIs to distinguish between regular outputs, and the synthetic takeover attempts from your takeover dataset (without training the rest of the weights of the network). Probes are linear classifiers trained on the AIs' activations - usually activations at different depths in the residual stream of Transformers.
Carefully audit the top-1% most suspicious actions of AIs according to the probes (and the other monitoring methods) before they are submitted to the scaffold through which the AIs interact with the codebase / its long-term memory.
This is not a silver bullet, but we think coup probes have a reasonable chance of catching the first real coup attempt (e.g. ~50%) if you use a false positive rate of 1%. Ensembling in other techniques could further improve our ability to select the most suspicious actions (though we won't talk about other approaches for the remainder of this post).
Testing generalization properties of probes on a toy task
Note: We didn't spend a lot of time on experiments, some of the results are surprising, and might therefore be wrong (p=10% that some of them are wrong).
The setting
Instead of studying coup attempts, we instead train probes to identify pairs of (instruction, answer) where the answer contains advice about how to commit theft. We have GPT-4 craft instructions whic...
...more
24min

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast

9 Listeners

Share The Nonlinear Library: Alignment Forum

Sign up to save your podcasts

The Nonlinear Library: Alignment Forum

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The Nonlinear Library: Alignment Forum episodes:

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast