The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio conte... more

· Education

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

The Nonlinear Library: Alignment Forum episodes:

March 11, 2024 AF - Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought by miles
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought, published by miles on March 11, 2024 on The AI Alignment Forum.
[Twitter thread]
I'm not going to add much additional commentary at the moment and will just let people check out the paper!
But to give a bit more context: This paper is building off prior work we have done showing that chain-of-thought explanations can be misleading, which I wrote about on the alignment forum here. Broadly, this work fits into the process-based oversight agenda (e.g., here). Consistency training/evaluation also fits into scalable oversight: Evaluating consistency may be easier than evaluating a model's reasoning or final predictions directly (e.g., also explored here).
Abstract:
While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models' behavior--for example, rationalizing answers in line with a user's opinion without mentioning this bias.
To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86% on held-out tasks.
Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where supervision for ground truth reasoning is unavailable.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
2min
March 11, 2024 AF - How disagreements about Evidential Correlations could be settled by Martín Soto
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How disagreements about Evidential Correlations could be settled, published by Martín Soto on March 11, 2024 on The AI Alignment Forum.
Since beliefs about Evidential Correlations don't track any direct ground truth, it's not obvious how to resolve disagreements about them, which is very relevant to acausal trade.
Here I present what seems like the only natural method (Third solution below).
Ideas partly generated with Johannes Treutlein.
Say two agents (algorithms A and B), who follow EDT, form a coalition. They are jointly deciding whether to pursue action a. Also, they would like an algorithm C to take action c. As part of their assessment of a, they're trying to estimate how much evidence (their coalition taking) a would provide for C taking c. If it gave a lot of evidence, they'd have more reason to take a. But they disagree: A thinks the correlation is very strong, and B thinks it's very weak.
This is exactly the situation in which researchers in acausal trade have many times found themselves: they are considering whether to take a slightly undesirable action a (spending a few resources on paperclips), which could provide evidence for another agent C (a paperclip-maximizing AI in another lightcone) taking an action c (the AI spending a few resources on human happiness) that we'd like to happen.
But different researchers A and B (within the coalition of "humans trying to maximize human happiness") have different intuitions about
A priori, there could exist the danger that, by thinking more, they would unexpectedly learn the actual output of C. This would make the trade no longer possible, since then taking a would give them no additional evidence about whether c happens. But, for simplicity, assume that C is so much more complex and chaotic than what A and B can compute, that they are very certain this won't happen.
First solution: They could dodge the question by just looking for different actions to take they don't disagree on. But that's boring.
Second solution: They could aggregate their numeric credences somehow. They could get fancy on how to do this. They could even get into more detail, and aggregate parts of their deliberation that are more detailed and informative than a mere number (and that are upstream of this probability), like different heuristics or reference-class estimates they've used to come up with them.
They might face some credit assignment problems (which of my heuristics where most important in setting this probability?). This is not boring, but it's not yet what I want to discuss.
Let's think about what these correlations actually are and where they come from. These are actually probabilistic beliefs about logical worlds. For example, A might think that in the world where they play a (that is, conditioning A's distribution on this fact), the likelihood of C playing c is 0.9. While if they don't, it's 0.3. Unfortunately, only one of the two logical worlds will be actual. And so, one of these two beliefs will never be checked against any ground truth.
If they end up taking a,
there won't be any mathematical fact of the matter as to what would have happened if they had not.
But nonetheless, it's not as if "real math always gets feedback, and counterfactuals never do": after all, the still-uncertain agent doesn't know which counterfactual will be real, and so they use the same general heuristics to think about all of them. When reality hits back on the single counterfactual that becomes actual, it is this heuristic that will be chiseled.
I think that's the correct picture of bounded logical learning: a pile of heuristics learning through time. This is what Logical Inductors formalize.[1]
It thus becomes clear that correlations are the "running-time by-product" of using these heuristics to approximate real math. Who cares only one of the cou...
...more
9min
March 11, 2024 AF - Understanding SAE Features with the Logit Lens by Joseph Isaac Bloom
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding SAE Features with the Logit Lens, published by Joseph Isaac Bloom on March 11, 2024 on The AI Alignment Forum.
This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with support from Neel Nanda and Arthur Conmy. Joseph Bloom is funded by the LTFF, Manifund Regranting Program, donors and LightSpeed Grants. This post makes extensive use of
Neuronpedia, a platform for interpretability focusing on accelerating interpretability researchers working with SAEs.
Links: SAEs on HuggingFace, Analysis Code
Executive Summary
This is an informal post sharing statistical methods which can be used to quickly / cheaply better understand Sparse Autoencoder (SAE) features.
Firstly, we use statistics (standard deviation, skewness and kurtosis) of the logit weight distributions of features (WuWdec[feature]) to characterize classes of features, showing that many features can be understood as promoting / suppressing interpretable classes of tokens.
We propose 3 different kinds of features, analogous to previously characterized "
universal neurons":
Partition Features, which (somewhat) promote half the tokens and suppress the other half according to capitalization and spaces (example pictured below)
Suppression Features, which act like partition features but are more asymmetric.
Prediction Features which promote tokens in classes of varying sizes, ranging from promoting tokens that have a close bracket to promoting all present tense verbs.
Secondly, we propose a statistical test for whether a feature's output direction is trying to distinguish tokens in some set (eg: "all caps tokens") from the rest.
We borrowed this technique from systems biology where it is used at scale frequently.
The key limitation here is that we need to know in advance which sets of tokens are promoted / inhibited.
Lastly, we demonstrate the utility of the set-based technique by using it to locate features which enrich token categories of interest (defined by regex formulas, NLTK toolkit parts of speech tagger and common baby names for boys/girls).
Feature 4467. Above: Feature Dashboard Screenshot from Neuronpedia. It is not immediately obvious from the dashboard what this feature does. Below: Logit Weight distribution classified by whether the token starts with a space, clearly indicating that this feature promotes tokens which lack an initial space character.
Introduction
In
previous work, we trained and open-sourced a set of sparse autoencoders (SAEs) on the residual stream of GPT2 small. In collaboration with
Neuronpedia, we've produced feature dashboards, auto-interpretability explanations and interfaces for browsing for ~300k+ features. The analysis in this post is performed on features from the layer 8 residual stream of GPT2 small (for no particular reason).
SAEs might enable us to decompose model internals into interpretable components. Currently, we don't have a good way to measure interpretability at scale, but we can generate
feature dashboards which show things like how often the feature fires, its direct effect on tokens being sampled (the logit weight distribution) and when it fires (see examples of feature dashboards below). Interpreting the logit weight distribution in feature dashboards for multi-layer models is implicitly using
Logit Lens, a very popular technique in mechanistic interpretability. Applying the logit lens to features means that we compute the product of a feature direction and the unembed (WuWdec[feature]), referred to as the "logit weight distribution".
Since SAEs haven't been around for very long, we don't yet know what the logit weight distributions typically look like for SAE features. Moreover, we find that the form of logit weight distribution can vary greatly. In most cases we see a vaguely normal distribution and s...
...more
27min
March 10, 2024 AF - 0th Person and 1st Person Logic by Adele Lopez
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 0th Person and 1st Person Logic, published by Adele Lopez on March 10, 2024 on The AI Alignment Forum.
Truth values in classical logic have more than one interpretation.
In 0th Person Logic, the truth values are interpreted as True and False.
In 1st Person Logic, the truth values are interpreted as Here and Absent relative to the current reasoner.
Importantly, these are both useful modes of reasoning that can coexist in a logical embedded agent.
This idea is so simple, and has brought me so much clarity that I cannot see how an adequate formal theory of anthropics could avoid it!
Crash Course in Semantics
First, let's make sure we understand how to connect logic with meaning. Consider classical propositional logic. We set this up formally by defining terms, connectives, and rules for manipulation. Let's consider one of these terms: A. What does this mean? Well, its meaning is not specified yet!
So how do we make it mean something? Of course, we could just say something like "Arepresents the statement that 'a ball is red'". But that's a little unsatisfying, isn't it? We're just passing all the hard work of meaning to English.
So let's imagine that we have to convey the meaning of A without using words. We might draw pictures in which a ball is red, and pictures in which there is not a red ball, and say that only the former are A. To be completely unambiguous, we would need to consider all the possible pictures, and point out which subset of them are A. For formalization purposes, we will say that this set is the meaning of A.
There's much more that can be said about semantics (see, for example, the Highly Advanced Epistemology 101 for Beginners sequence), but this will suffice as a starting point for us.
0th Person Logic
Normally, we think of the meaning of A as independent of any observers. Sure, we're the ones defining and using it, but it's something everyone can agree on once the meaning has been established. Due to this independence from observers, I've termed this way of doing things 0th Person Logic (or 0P-logic).
The elements of a meaning set I'll call worlds in this case, since each element represents a particular specification of everything in the model. For example, say that we're only considering states of tiles on a 2x2 grid. Then we could represent each world simply by taking a snapshot of the grid.
From logic, we also have two judgments. A is judged True for a world iff that world is in the meaning of A. And False if not. This judgement does not depend on who is observing it; all logical reasoners in the same world will agree.
1st Person Logic
Now let's consider an observer using logical reasoning. For metaphysical clarity, let's have it be a simple, hand-coded robot. Fix a set of possible worlds, assign meanings to various symbols, and give it the ability to make, manipulate, and judge propositions built from these.
Let's give our robot a sensor, one that detects red light. At first glance, this seems completely unproblematic within the framework of 0P-logic.
But consider a world in which there are three robots with red light sensors. How do we give A the intuitive meaning of "my sensor sees red"? The obvious thing to try is to look at all the possible worlds, and pick out the ones where the robot's sensor detects red light. There are three different ways to do this, one for each instance of the robot.
That's not a problem if our robot knows which robot it is. But without sensory information, the robot doesn't have any way to know which one it is! There may be both robots which see a red signal, and robots which do not - and nothing in 0P-Logic can resolve this ambiguity for the robot, because this is still the case even if the robot has pinpointed the exact world it's in!
So statements like "my sensor sees red" aren't actually picking out subsets of w...
...more
10min
March 08, 2024 AF - Scenario Forecasting Workshop: Materials and Learnings by elifland
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scenario Forecasting Workshop: Materials and Learnings, published by elifland on March 8, 2024 on The AI Alignment Forum.
Disclaimer: While some participants and organizers of this exercise work in industry, no proprietary info was used to inform these scenarios, and they represent the views of their individual authors alone.
Overview
In the vein of
What 2026 Looks Like and
AI Timelines discussion, we recently hosted a scenario forecasting workshop. Participants first wrote a 5-stage scenario forecasting what will happen between now and ASI. Then, they reviewed, discussed, and revised scenarios in groups of 3. The discussion was guided by forecasts like "If I were to observe this person's scenario through stage X, what would my ASI timelines median be?".
Instructions for running the workshop including notes on what we would do differently are available
here. We've put 6 shared scenarios from our workshop in a publicly viewable folder
here.
Motivation
Writing scenarios may help to:
Clarify views, e.g. by realizing an abstract view is hard to concretize, or realizing that two views you hold don't seem very compatible.
Surface new considerations, e.g. realizing a subquestion is more important than you thought, or that an actor might behave in a way you hadn't considered.
Communicate views to others, e.g. clarifying what you mean by "AGI", "slow takeoff", or the singularity.
Register qualitative forecasts, which can then be compared against reality. This has advantages and disadvantages vs. more resolvable forecasts (though scenarios can include some resolvable forecasts as well!).
Running the workshop
Materials and instructions for running the workshop including notes on what we would do differently are available
here.
The schedule for the workshop looked like:
Session 1 involved writing a 5-staged scenario forecasting what will happen between now and ASI.
Session 2 involved reviewing, discussing, and revising scenarios in groups of 3. The discussion was guided by forecasts like "If I were to observe this person's scenario through stage X, what would my ASI timelines median be?". There were analogous questions for p(disempowerment) and p(good future).
Session 3 was freeform discussion and revision within groups, then there was a brief session for feedback.
Workshop outputs and learnings
6 people (3 anonymous, 3 named) have agreed to share their scenarios. We've put them in a publicly viewable folder
here.
We received overall positive feedback, with nearly all 23 people who filled out the feedback survey saying it was a good use of their time. In general, people found the writing portion more valuable than the discussion. We've included some ideas on how to improve future similar workshops based on this feedback and a few other pieces of feedback in our
instructions for organizers. It's possible that a workshop that is much more focused on the writing relative to the discussion would be more valuable.
Speaking for myself (as Eli), I think it was mostly valuable as a forcing function to get people to do an activity they had wanted to do anyway. And scenario writing seems like a good thing for people to spend marginal time on (especially if they find it fun/energizing). It seems worthwhile to experiment with the format (in the ways we suggest above, or other ways people are excited about). It feels like there might be something nearby that is substantially more valuable than our initial pilot.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
4min
March 08, 2024 AF - Forecasting future gains due to post-training enhancements by elifland
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Forecasting future gains due to post-training enhancements, published by elifland on March 8, 2024 on The AI Alignment Forum.
This work has been done in the context of SaferAI's work on risk assessment. Equal contribution by Eli and Joel. I'm sharing this writeup in the form of a Google Doc and reproducing the summary below.
Disclaimer: this writeup is context for upcoming experiments, not complete work. As such it contains a lot of (not always well-justified) guess-work and untidy conceptual choices. We are publishing now despite this to get feedback.
If you are interested in this work - perhaps as a future collaborator or funder, or because this work could provide helpful input into e.g. risk assessments or RSPs - please get in touch with us at
[email protected] and/or
[email protected].
Summary
A recent report documented how the performance of AI models can be improved after training, via
post-training enhancements (PTEs) such as external tools, scaffolding, and fine-tuning. The gain from a PTE is measured in compute-equivalent gains (CEG): the multiplier on training compute required to achieve equivalent performance to a model combined with a PTE.
We are interested in understanding the contribution that PTEs make to AI system capabilities over time.
This question in turn is motivated by SaferAI's work on quantitative risk assessments of frontier models. In particular, any risk assessment of open-sourcing models or of having closed-source models stolen or leaked should take into account system capabilities, which we might expect to increase over time as PTEs are added to the system built on top of a given base model.
We extend a
recent analysis of PTEs in order to understand the trend in CEG over time.
There are serious limitations in our preliminary analysis, including:
problems with the CEG metric, many
uninformed parameter estimates, and reliance on an ill-defined
"average task".
High-priority
future work includes running experiments to get more evidence on important uncertainties for our forecasts of capability gains due to PTEs. In particular, we think it will be important to understand how well different PTEs combine, as well as to directly study performance on benchmarks relevant to dangerous capabilities rather than relying on the CEG and average task abstractions.
In this write-up, we will:
Outline our methodology. (
More.)
Present CEG estimates for various PTEs. (
More.)
Aggregate total CEG, using subjective estimates of 'composability.' (
More.)
Note limitations of our analysis and important future work. (
More.)
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
3min
March 07, 2024 AF - Evidential Correlations are Subjective, and it might be a problem by Martín Soto
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evidential Correlations are Subjective, and it might be a problem, published by Martín Soto on March 7, 2024 on The AI Alignment Forum.
I explain (in layman's terms) a realization that might make acausal trade hard or impossible in practice.
Summary: We know that if players believe different Evidential Correlations, they might miscoordinate. But clearly they will eventually learn to have the correct Evidential Correlations, right? Not necessarily, because there is no objective notion of correct here (in the way that there is for math or physics). Thus, selection pressures might be much weaker, and different agents might systematically converge on different ways of assigning Evidential Correlations.
Epistemic status: Confident that this realization is true, but the quantitative question of exactly how weak the selection pressures are remains open.
What are Evidential Correlations, really?
Skippable if you know the answer to the question.
Alice and Bob are playing a Prisoner's Dilemma, and they know each other's algorithms: Alice.source and Bob.source.[1] Since their algorithms are approximately as complex, each of them can't easily assess what the other will output. Alice might notice something like "hmm, Bob.source seems to default to Defection when it throws an exception, so this should update me slightly in the direction of Bob Defecting".
But she doesn't know exactly how often Bob.source throws an exception, or what it does when that doesn't happen.
Imagine, though, Alice notices Alice.source and Bob.source are pretty similar in some relevant ways (maybe the overall logical structure seems very close, or the depth of the for loops is the same, or she learns the training algorithm that shaped them is the same one). She's still uncertain about what any of these two algorithms outputs[2], but this updates her in the direction of "both algorithms outputting the same action".
If Alice implements/endorses Evidential Decision Theory, she will reason as follows:
Conditional on Alice.source outputting Defect, it seems very likely Bob.source also outputs Defect, thus my payoff will be low.
But conditional on Alice.source outputting Cooperate, it seems very likely Bob.source also outputs Cooperate, thus my payoff will be high.
So I (Alice) should output Cooperate, thus (very probably) obtain a high payoff.
To the extent Alice's belief about similarity was justified, it seems like she will perform pretty well on these situations (obtaining high payoffs). When you take this reasoning to the extreme, maybe both Alice and Bob are aware that they both know this kind of cooperation bootstrapping is possible (if they both believe they are similar enough), and thus (even if they are causally disconnected, and just simulating each others' codes) they can coordinate on some pretty complex trades.
This is Evidential Cooperation in Large worlds.
But wait a second: How could this happen, without them being causally connected? What was this mysterious similarity, this spooky correlation at a distance, that allowed them to create cooperation from thin air?
Well, in the words of Daniel Kokotajlo: it's just your credences, bro!
The bit required for this to work is that they believe that "it is very likely we both output the same thing". Said another way, they have high probability on the possible worlds "Alice.source = C, Bob.source = C" and "Alice.source = D, Bob.source = D", but low probability on the possible worlds "Alice.source = D, Bob.source = C" and "Alice.source = D, Bob.source = C".
This can also be phrased in terms of logical counterfactuals: if Alice.source = C, then it is very likely that Bob.source = C.[3] This is a logical counterfactual: there is, ultimately, a logical fact of the matter about what Alice.source outputs, but since she doesn't know it yet, she entertains what s...
...more
28min
March 06, 2024 AF - We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To by robertzk
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To, published by robertzk on March 6, 2024 on The AI Alignment Forum.
This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the
ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort
Executive Summary
In a previous post we trained
attention SAEs on
every layer of GPT-2 Small and we found that a majority of features are interpretable in all layers. We've since leveraged our SAEs as a tool to explore individual attention heads through the lens of SAE features.
Using our SAEs, we inspect the roles of every attention head in GPT-2 small, discovering a wide range of previously unidentified behaviors. We manually examined every one of the 144 attention heads and provide brief descriptions in
this spreadsheet. We note that this is a rough heuristic to get a sense of the most salient effects of a head and likely does not capture their role completely.
We observe that features become more abstract up to layer 9 and then less so after that. We performed this by interpreting and conceptually grouping the top 10 features attributed to all 144 heads.
Working from bottom to top layers, 39 of the 144 heads expressed surprising feature groupings not seen before in a previous head.
We provide
feature dashboards for each attention head.
To validate that our technique captures legitimate phenomena rather than spurious behaviors, we verified that our interpretations are consistent with previously studied heads in GPT-2 small. These include induction heads, previous token heads, successor heads and duplicate token heads. We note that our annotator mostly did not know a priori which heads had previously been studied.
To demonstrate that our SAEs can enable novel interpretability insights, we leverage our SAEs to develop a deeper understanding of why there are two induction heads in Layer 5. We show that one does standard induction and the other does "long prefix" induction.
We use our technique to investigate the prevalence of attention head polysemanticity. We think that the vast majority of heads (>90%) are performing multiple tasks, but also narrow down a set of 14 candidate heads that are plausibly monosemantic.
Introduction
In
previous work, we trained and open sourced a set of attention SAEs on all 12 layers of GPT-2 Small. We found that random SAE features in each layer were highly interpretable, and highlighted a set of interesting features families. We've since leveraged our SAEs as a tool to interpret the roles of attention heads. The key idea of the technique relies on our SAEs being trained to reconstruct the entire layer, but that contributions to specific heads can be inferred.
This allows us to find the top 10 features most salient to a given head, and note whenever there is a pattern that it may suggest a role of that head. We then used this to manually inspect the role of every head in GPT-2 small, and spend the rest of this post exploring various implications of our findings and the technique.
In the spirit of
An Overview of Early Vision in InceptionV1, we start with a high-level, guided tour of the different behaviors implemented by heads across every layer, building better intuitions for what attention heads learn in a real language model.
To validate that the technique is teaching something real about the roles of these heads, we confirm that our interpretations match previously studied heads. We note that our annotator mostly did not know a priori which heads had previously been studied. We find:
Induction heads (
5.1,
5.5,
6.9,
7.2,
7.10)
Previous token heads (
4.11)
Copy suppression head (
10.7)
Duplicate token heads (
3.0)
Successor head (
9.1)
In addition to building intuition about wh...
...more
25min
March 05, 2024 AF - Many arguments for AI x-risk are wrong by Alex Turner
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Many arguments for AI x-risk are wrong, published by Alex Turner on March 5, 2024 on The AI Alignment Forum.
The following is a lightly edited version of a memo I wrote for a retreat. It was inspired by a draft of Counting arguments provide no evidence for AI doom. I think that my post covers important points not made by the published version of that post.
I'm also thankful for the dozens of interesting conversations and comments at the retreat.
I think that the AI alignment field is partially founded on fundamentally confused ideas. I'm worried about this because, right now, a range of lobbyists and concerned activists and researchers are in Washington making policy asks. Some of these policy proposals seem to be based on erroneous or unsound arguments.[1]
The most important takeaway from this essay is that the (prominent) counting arguments for "deceptively aligned" or "scheming" AI provide ~0 evidence that pretraining + RLHF will eventually become intrinsically unsafe. That is, that even if we don't train AIs to achieve goals, they will be "deceptively aligned" anyways. This has important policy implications.
Disclaimers:
I am not putting forward a positive argument for alignment being easy. I am pointing out the invalidity of existing arguments, and explaining the implications of rolling back those updates.
I am not saying "we don't know how deep learning works, so you can't prove it'll be bad." I'm saying "many arguments for deep learning -> doom are weak. I undid those updates and am now more optimistic."
I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it's not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.[2]
Tracing back historical arguments
In the next section, I'll discuss the counting argument. In this one, I want to demonstrate how often foundational alignment texts make crucial errors. Nick Bostrom's Superintelligence, for example:
A range of different methods can be used to solve "reinforcement-learning problems," but they typically involve creating a system that seeks to maximize a reward signal. This has an inherent tendency to produce the wireheading failure mode when the system becomes more intelligent. Reinforcement learning therefore looks unpromising. (p.253)
To be blunt, this is nonsense. I have long meditated on the nature of "reward functions" during my PhD in RL theory.
In the most useful and modern RL approaches, "reward" is a tool used to control the strength of parameter updates to the network.[3] It is simply not true that "[RL approaches] typically involve creating a system that seeks to maximize a reward signal." There is not a single case where we have used RL to train an artificial system which intentionally "seeks to maximize" reward.[4] Bostrom spends a few pages making this mistake at great length.[5]
After making a false claim, Bostrom goes on to dismiss RL approaches to creating useful, intelligent, aligned systems. But, as a point of further fact, RL approaches constitute humanity's current best tools for aligning AI systems today! Those approaches are pretty awesome. No RLHF, then no GPT-4 (as we know it).
In arguably the foundational technical AI alignment text, Bostrom makes a deeply confused and false claim, and then perfectly anti-predicts what alignment techniques are promising.
I'm not trying to rag on Bostrom personally for making this mistake. Foundational texts, ahead of their time, are going to get some things wrong. But that doesn't save us from the subsequent errors which avalanche from this kind of early mistake. These deep errors have costs measured in tens of thousands of researcher-hours. Due to the "RL->reward maximizing" meme, I personally misdirected thousands of hours on proving power-se...
...more
20min
March 04, 2024 AF - Anthropic release Claude 3, claims >GPT-4 Performance by Lawrence Chan
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic release Claude 3, claims >GPT-4 Performance, published by Lawrence Chan on March 4, 2024 on The AI Alignment Forum.
Today, we're announcing the Claude 3 model family, which sets new industry benchmarks across a wide range of cognitive tasks. The family includes three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application.
Better performance than GPT-4 on many benchmarks
The largest Claude 3 model seems to outperform GPT-4 on benchmarks (though note slight differences in evaluation methods):
Opus, our most intelligent model, outperforms its peers on most of the common evaluation benchmarks for AI systems, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K), and more. It exhibits near-human levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence.
I haven't had the chance to interact much with this model yet, but as of writing, Manifold assigns ~70% probability to Claude 3 outperforming GPT-4 on the LMSYS Chatbot Arena Leaderboard.
https://manifold.markets/JonasVollmer/will-claude-3-outrank-gpt4-on-the-l?r=Sm9uYXNWb2xsbWVy
Synthetic data?
According to Anthropic, Claude 3 was trained on synthetic data (though it was not trained on any customer-generated data from previous models):
Also interesting that the model can identify the synthetic nature of some of its evaluation tasks. For example, it provides the following response to a synthetic recall text:
Will Anthropic push the frontier of AI development?
Several people have pointed out that this post seems to take a different stance on race dynamics than was expressed previously:
As we push the boundaries of AI capabilities, we're equally committed to ensuring that our safety guardrails keep apace with these leaps in performance. Our hypothesis is that being at the frontier of AI development is the most effective way to steer its trajectory towards positive societal outcomes.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
3min

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast

9 Listeners

Share The Nonlinear Library: Alignment Forum

Sign up to save your podcasts

The Nonlinear Library: Alignment Forum

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The Nonlinear Library: Alignment Forum episodes:

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast