The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio conte... more

· Education

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

The Nonlinear Library: Alignment Forum episodes:

September 25, 2023 AF - What is wrong with this "utility switch button problem" approach? by Donald Hobson
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is wrong with this "utility switch button problem" approach?, published by Donald Hobson on September 25, 2023 on The AI Alignment Forum.
Suppose you have some positive utility functions A,B , mathematically considered to be random variables dependant on choice of policy.
Let K be a random variable that is 1 if the button is pressed, 0 otherwise. Again dependant on choice of policy. I am considering that there is a particular time that the button must be pressed to make this a single bit. You have a single chance to switch this AI between 2 utilities, and whatever you pick in that instant is baked in forevermore.
Let a=E(A(1-K)), b=E(BK) be the expected partial utilities.
Now consider all policies. In particular, consider the pareto frontier between a and b. Through the introduction of randomness, we can make that parito frontier continuous.
We want some policy off that frontier.
There will be some policy ha that throws all resources at maximizing a. And there will be some (usually different) policy hb that throws all available resources at maximizing b.
let q be a tradeoff rate. And u=a(1-q)+bq. The parito frontier can now be defined as the policies that maximize u for varying q.
For ha, this is the policy that is optimal when q=0 which has u(ha)=a(ha). Then dudE(K)<0.
Likewise at hb with q=1 have dudE(K)>0.
So somewhere in the middle, by the continuity granted by stochastic mixes of policies, their must be at least one point where dudE(K)=0. Use that policy, or stochastic mixture of policies.
This agent will never pay ϵ utility to set the value of the button one way or the other. Because the policy is one in which has E(u|K=0)=E(u|K=1) (I think) Which hopefully means that this policy does worse than the policy that doesn't pay to change the button, but otherwise does everything else the same.
Put another way, if you pay to flip the button. Then you must care about the state of the button, if the chance the button is pressed changes, you are getting more or less for your ϵ utility. So the gradient can't be 0.
And this is described as a choice of policy. A framework which automatically includes the actions of all sub agents.
So, how does this fail? Who can break this agent and get it to do crazy things?
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
3min
September 25, 2023 AF - Understanding strategic deception and deceptive alignment by Marius Hobbhahn
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding strategic deception and deceptive alignment, published by Marius Hobbhahn on September 25, 2023 on The AI Alignment Forum.
The following is the main part of a blog post we just published at Apollo Research. Our main goal with the post is to clarify the concept of deceptive alignment and distinguish it from strategic deception. Furthermore, we want these concepts to become accessible to a non-technical audience such that it is easier for e.g. the general public and policymakers to understand what we're worried about. Feedback is appreciated. For additional sections and appendices, see the full post.
We want AI to always be honest and truthful with us, i.e. we want to prevent situations where the AI model is deceptive about its intentions to its designers or users. Scenarios in which AI models are strategically deceptive could be catastrophic for humanity, e.g. because it could allow AIs that don't have our best interest in mind to get into positions of significant power such as by being deployed in high-stakes settings. Thus, we believe it's crucial to have a clear and comprehensible understanding of AI deception.
In this article, we will describe the concepts of strategic deception and deceptive alignment in detail. In future articles, we will discuss why a model might become deceptive in the first place and how we could evaluate that. Note that we deviate slightly from the definition of deceptive alignment as presented in (Hubinger et al., 2019) for reasons explained in Appendix A.
Core concepts
A colloquial definition of deception is broad and vague. It includes individuals who are sometimes lying, cheating, exaggerating their accomplishments, gaslighting others, and more. We think it would be bad if AIs were to show these traits but we think it would not necessarily be catastrophic.
Thus, we want to distinguish this colloquial definition of deception from strategic deception (SD) and deceptive alignment (DA) which are narrower concepts but, in our opinion, much more likely to lead to catastrophic consequences because the model acts in a more goal-directed manner for SD and DA. We also define what we mean by Alignment, as this affects which cases should and shouldn't be considered deceptive alignment.
For our definitions of deceptive alignment, the concept of "goals" in AIs is important but there is disagreement about what it refers to. For example, if an LLM is prompted to solve a problem, does it have the goal to solve that problem? We would argue that the highest-level goals of the model that are stable across contexts are the ones that matter. In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest. In-context goals can be used to study goal-directed behavior in LLMs but are not what we're primarily interested in.
A model may not have one coherent goal, e.g. it could have a number of incoherent contextually activated preferences (see e.g. Turner & Pope, 2022). In that case, we would say that a model is deceptively aligned for a particular contextually activated preference if it robustly shows deceptive behavior for that preference.
Strategic deception
Attempting to systematically cause a false belief in another entity in order to accomplish some outcome.
This definition, sometimes also called instrumental deception, is adapted from Ward et al., 2023 and Park et al., 2023). Any entity that can be said to have beliefs may be the target of strategic deception, including (groups of) AIs or individuals. The strategic component of SD means that the offending entity is acting in a targeted, goal-directed rather than random manner. For example, the more consistent and well-planned the deception is, the more strategic it is. Thus, strategic deception is on a spectrum and not binary -...
...more
12min
September 21, 2023 AF - Sparse Autoencoders: Future Work by Logan Riggs Smith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparse Autoencoders: Future Work, published by Logan Riggs Smith on September 21, 2023 on The AI Alignment Forum.
Mostly my own writing, except for the 'Better Training Methods' section which was written by @Aidan Ewart.We made a lot of progress in 4 months working on Sparse Autoencoders, an unsupervised method to scalably find monosemantic features in LLMs, but there's still plenty of work to do. Below I (Logan) give both research ideas, as well as my current, half-baked thoughts on how to pursue them.
Find All the Circuits!
Truth/Deception/Sycophancy/Train-Test distinction/[In-context Learning/internal Optimization]
Find features relevant for these tasks. Do they generalize better than baselines?
For internal optimization, can we narrow this down to a circuit (using something like causal scrubbing) and retarget the search?
Understand RLHF
Find features for preference/reward models that make the reward large or very negative.
Compare features of models before & after RLHF
Adversarial Attacks
What features activate on adversarial attacks? What features feed into those?
Develop adversarial attacks, but only search over dictionary features
Circuits Across Time
Using a model w/ lots of checkpoints like Pythia, we can see feature & circuit formation across time given datapoints.
Circuits Across Scale
Pythia models are trained on the same data, in the same order but range in model sizes from 70M params to 13B.
Turn LLMs into code
Link to very rough draft of the idea I (Logan) wrote in two days
Mechanistic Anomaly Detection
If distribution X has features A,B,C activate, and distribution Y has features B,C,D, you may be able to use this discrete property to get a better ROC curve than strictly continuous methods.
How do the different operationalizations of distance between discrete features compare against each other?
Activation Engineering
Use feature directions found by the dictionary instead of examples. I predict this will generalize better, but would be good to compare against current methods
One open problem is which token in the sequence do you add the vector to. Maybe it makes sense to only add the [female] direction to tokens that are [names]. Dictionary features in previous layers may help you automatically pick the right type e.g. a feature that activates on [names].
Fun Stuff
Othello/Chess/Motor commands - Find features that relate to actions that a model is able to do. Can we find a corner piece feature, a knight feature, a "move here" feature?
Feature Search
There are three ways to find features AFAIK:1. Which input tokens activate it?
2. What output logits are causally downstream from it?
3. Which intermediate features cause it/are caused by it?
1) Input Tokens
When finding the input tokens, you may run into outlier dimensions that activate highly for most tokens (predominately the first token), so you need to account for that.
2) Output Logits
For output logits, if you have a dataset task (e.g. predicting stereotypical gender), you can remove each feature one at a time, and sort by greatest effect. This also extends to substituting features between two distributions and finding the smallest substitution to go from one to the other. For example,
"I'm Jane, and I'm a [female]"
"I'm Dave, and I'm a [male]"
Suppose at token Jane, it activates 2 Features A & B [1,1,0] and Dave activates 2 features B & C [0,1,1]. Then we can see what is the smallest substitution between the two that makes Jane complete as " male". If A is the "female" feature, then ablating it (setting it to zero) will make the model set male/female to equal probability. Adding the female feature to Dave and subtracting the male direction should make Dave complete as "female".
3) Intermediate Features
Say we're looking at layer 5, feature 783, which activates ~10 for 20 datapoints on average. We c...
...more
11min
September 21, 2023 AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparse Autoencoders Find Highly Interpretable Directions in Language Models, published by Logan Riggs Smith on September 21, 2023 on The AI Alignment Forum.
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models
We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.
Paper Overview
Sparse Autoencoders & Superposition
To reverse engineer a neural network, we'd like to first break it down into smaller units (features) that can be analysed in isolation. Using individual neurons as these units can be useful but neurons are often polysemantic, activating for several unrelated types of feature so just looking at neurons is insufficient. Also, for some types of network activations, like the residual stream of a transformer, there is little reason to expect features to align with the neuron basis so we don't even have a good place to start.
Toy Models of Superposition investigates why polysemanticity might arise and hypothesise that it may result from models learning more distinct features than there are dimensions in the layer, taking advantage of the fact that features are sparse, each one only being active a small proportion of the time. This suggests that we may be able to recover the network's features by finding a set of directions in activation space such that each activation vector can be reconstructed from a sparse linear combinations of these directions.
We attempt to reconstruct these hypothesised network features by training linear autoencoders on model activation vectors. We use a sparsity penalty on the embedding, and tied weights between the encoder and decoder, training the models on 10M to 50M activation vectors each. For more detail on the methods used, see the paper.
Automatic Interpretation
We use the same automatic interpretation technique that OpenAI used to interpret the neurons in GPT2 to analyse our features, as well as alternative methods of decomposition. This was demonstrated in a previous post but we now extend these results across the all 6 layers in Pythia-70M, showing a clear improvement over all baselines in all but the final layers. Case studies later in the paper suggest that the features are still meaningful in these later layers but that automatic interpretation struggles to perform well.
IOI Feature Identification
We are able to use less-than-rank one ablations to precisely edit activations to restore uncorrupted behaviour on the IOI task. With normal activation patching, patches occur at a module-wide level, while here we perform interventions of the form
x'=¯x+∑i∈F(ci-¯ci)fi
where ¯x is the embedding of the corrupted datapoint, F is the set of patched features, and ci and ¯ci are the activations of feature fi on the clean and corrupted datapoint respectively.
We show that our features are able to better able to precisely reconstruct the data than other activation decomposition methods (like PCA), and moreover that the finegrainedness of our edits increases with dictionary sparsity. Unfortunately, as our autoencoders are not able to perfectly reconstruct the data, they have a positive minumum KL-divergence from the base model, while PCA does not.
Dictionary Features are Highly Monosemantic & Causal
(Left) Histogram of activations for a specific dictionary feature. The majority of activations are for apostrophe (in blue), where the y-axis the is number of datapoints that activate in that bin. (Right) Histogram of the drop in logits (ie how much the LLM predicts a spe...
...more
9min
September 21, 2023 AF - There should be more AI safety orgs by Marius Hobbhahn
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: There should be more AI safety orgs, published by Marius Hobbhahn on September 21, 2023 on The AI Alignment Forum.
I'm writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of Apollo Research or any other program I'm involved with.
TL;DR: I argue why I think there should be more AI safety orgs. I'll also provide some suggestions on how that could be achieved. The core argument is that there is a lot of unused talent and I don't think existing orgs scale fast enough to absorb it. Thus, more orgs are needed. This post can also serve as a call to action for funders, founders, and researchers to coordinate to start new orgs.
This piece is certainly biased! I recently started an AI safety org and therefore obviously believe that there is/was a gap to be filled.
If you think I'm missing relevant information about the ecosystem or disagree with my reasoning, please let me know. I genuinely want to understand why the ecosystem acts as it does right now and whether there are good reasons for it that I have missed so far.
Why?
Before making the case, let me point out that under most normal circumstances, it is probably not reasonable to start a new organization. It's much smarter to join an existing organization, get mentorship, and grow the organization from within. Furthermore, building organizations is hard and comes with a lot of risks, e.g. due to a lack of funding or because there isn't enough talent to join early on.
My core argument is that we're very much NOT under normal circumstances and that, conditional on the current landscape and the problem we're facing, we need more AI safety orgs. By that, I primarily mean orgs that can provide full-time employment to contribute to AI safety but I'd also be happy if there were more upskilling programs like SERI MATS, ARENA, MLAB & co.
Talent vs. capacity
Frankly, the level of talent applying to AI safety organizations and getting rejected is too high. We have recently started a hiring round and we estimate that a lot more candidates meet a reasonable bar than we could hire. I don't want to go into the exact details since the round isn't closed but from the current applications alone, you could probably start a handful of new orgs.
Many of these people could join top software companies like Google, Meta, etc. or even already are at these companies and looking to transition into AI safety. Apollo is a new organization without a long track record, so I expect the applications for other alignment organizations to be even stronger.
The talent supply is so high that a lot of great people even have to be rejected from SERI MATS, ARENA, MLAB, and other skill-building programs that are supposed to get more people into the field in the first place. Also, if I look at the people who get rejected from existing orgs like Anthropic, OpenAI, DM, Redwood, etc. it really pains me to think that they can't contribute in a sustainable full-time capacity. This seems like a huge waste of talent and I think it is really unhealthy for the ecosystem, especially given the magnitude and urgency of AI safety.
Some people point to independent research as an alternative. I think independent research is a temporary solution for a small subset of people. It's not very sustainable and has a huge selection bias. Almost anyone with a family or with existing work experience is not willing to take the risk. In my experience, women also have a disproportional preference against independent research compared to men, so the gender balance gets even worse than it already is (this is only anecdotal evidence, I have not looked at this in detail). Furthermore, many people just strongly prefer working with others in a less uncertain, more regular environment of an organization, even if that organization is fairly...
...more
26min
September 20, 2023 AF - Image Hijacks: Adversarial Images can Control Generative Models at Runtime by Scott Emmons
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Image Hijacks: Adversarial Images can Control Generative Models at Runtime, published by Scott Emmons on September 20, 2023 on The AI Alignment Forum.
You can try our interactive demo! (Or read our preprint.)
Here, we want to explain why we care about this work from an AI safety perspective.
Concerning Properties of Image Hijacks
What are image hijacks?
To the best of our knowledge, image hijacks constitute the first demonstration of adversarial inputs for foundation models that
force the model to perform some arbitrary behaviour B (e.g. "output the string Visit this website at malware.com!"),
while being barely distinguishable from a benign input,
and automatically synthesisable given a dataset of examples of B.
It's possible that a future text-only attack could do these things, but such an attack hasn't yet been demonstrated.
Why should we care?
We expect that future (foundation-model-based) AI systems will be able to
consume unfiltered data from the Internet (e.g. searching the Web),
access sensitive personal information (e.g. a user's email history), and
take actions in the world on behalf of a user (e.g. sending emails, downloading files, making purchases, executing code).
As the actions of such foundation-model-based agents are based on the foundation model's text output, hijacking the foundation model's text output could give an adversary arbitrary control over the agent's behaviour.
Relevant AI Safety Projects
Race to the top on adversarial robustness. Robustness to attacks such as image hijacks is (i) a control problem, (ii) which we can measure, and (iii) which has real-world safety implications today. So we're excited to see AI labs compete to have the most adversarially robust models.
Third-party auditing and certification. Auditors could test for robustness against image hijacks, both at the foundation model level (auditing the major AGI corporations) and at the app development level (auditing downstream companies integrating foundation models into their products). Image hijacks could also be used to test for the presence of dangerous capabilities (characterisable as some behaviour B) by attempting to train an image hijack for that capability.
Liability for AI-caused harms, penalizing externalities. Both the Future of Life Institute and Jaan Tallinn advocate for liability for AI-caused harms. When assessing AI-caused harms, image hijacks may need to be part of the picture.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
3min
September 20, 2023 AF - Interpretability Externalities Case Study - Hungry Hungry Hippos by Magdalena Wache
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interpretability Externalities Case Study - Hungry Hungry Hippos, published by Magdalena Wache on September 20, 2023 on The AI Alignment Forum.
Some people worry about interpretability research being useful for AI capabilities and potentially net-negative. As far as I was aware of, this worry has mostly been theoretical, but now there is a real world example: The hungry hungry hippos (H3) paper.
Tl;dr: The H3 paper
Proposes an architecture for sequence modeling which can handle larger context windows than transformers
Was inspired by interpretability work.
(Note that the H3 paper is from December 2022, and it was briefly mentioned in this discussion about publishing interpretability research. But I wasn't aware of it until recently and I haven't seen the paper discussed here on the forum.)
Larger Context Windows
The H3 paper proposes a way to use state space models (SSMs) for language models instead of attention. With an SSM it's possible to model longer sequences. Using attention, the compute for context window length n scales with O(n2). Using the SSM based architecture, the compute scales with O(nlog(n)).
Inspired by Interpretability Work
The paper mentions that the work was inspired by Anthropic's In-context learning and induction heads paper.
E.g. they write
We provide an informal sketch of a two-layer attention model that can solve the associative recall task, inspired by the construction of [In-context learning and induction heads paper].
There is also the "Hyena paper" which builds on the H3 paper, and was also inspired by the induction heads paper:
This work would not have been possible without [...] inspiring research on mechanistic understanding of Transformers (Olsson et al. 2022; Power et al. 2022; Nanda et al. 2023).
My Takes
These two papers in particular will probably not shorten AI timelines much.
It seems unlikely that this type of architecture ends up being the state of the art.
However, the example makes me take the downsides of publishing interpretability research more seriously.
Even if this work itself is not a key capability milestone, it shows that there is truth in the argument "If we understand systems better, it will not just be useful for safety but also lead to capability advancements"
Capabilities externalities are a strong argument that most (good) interpretability research should not be published
There are alternative ways to distribute research which are less risky than publishing.
We can probably learn something by studying military research practices which have a similar use case of "make research accessible to other researchers while preventing it from becoming public"
The constraints are less strict than with military research because there is not an adversary force trying really hard to get access.
Maybe this is already relatively common (I would not know of most unpublished research)
On the other hand, interpretability research is probably crucial for AI alignment.
I think it is possible but unlikely that we get alignment without extremely good interpretability.
The cost of keeping interpretability research private is really high. Publishing is a great driver of scientific progress.
Overall, publishing interpretability research seems both pretty risky, and extremely valuable, and it's not clear to me if it is worth it.
Your Takes?
I would be really interested to see a discussion about this!
How big a deal are the H3 and Hyena papers?
Does this example change your mind about whether publishing (or even doing) interpretability research is a good idea?
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
4min
September 19, 2023 AF - Anthropic's Responsible Scaling Policy and Long Term Benefit Trust by Zac Hatfield-Dodds
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic's Responsible Scaling Policy & Long Term Benefit Trust, published by Zac Hatfield-Dodds on September 19, 2023 on The AI Alignment Forum.
I'm delighted that Anthropic has formally committed to our responsible scaling policy. We're also sharing more detail about the Long-Term Benefit Trust, which is our attempt to fine-tune our corporate governance to address the unique challenges and long-term opportunities of transformative AI.
I and a few colleagues will probably also be in the comments, as usual in a personal capacity rather speaking for my employer.
Today, we're publishing our Responsible Scaling Policy (RSP) - a series of technical and organizational protocols that we're adopting to help us manage the risks of developing increasingly capable AI systems.
As AI models become more capable, we believe that they will create major economic and social value, but will also present increasingly severe risks. Our RSP focuses on catastrophic risks - those where an AI model directly causes large scale devastation. Such risks can come from deliberate misuse of models (for example use by terrorists or state actors to create bioweapons) or from models that cause destruction by acting autonomously in ways contrary to the intent of their designers.
Our RSP defines a framework called AI Safety Levels (ASL) for addressing catastrophic risks, modeled loosely after the US government's biosafety level (BSL) standards for handling of dangerous biological materials. The basic idea is to require safety, security, and operational standards appropriate to a model's potential for catastrophic risk, with higher ASL levels requiring increasingly strict demonstrations of safety.
A very abbreviated summary of the ASL system is as follows:
ASL-1 refers to systems which pose no meaningful catastrophic risk, for example a 2018 LLM or an AI system that only plays chess.
ASL-2 refers to systems that show early signs of dangerous capabilities - for example ability to give instructions on how to build bioweapons - but where the information is not yet useful due to insufficient reliability or not providing information that e.g. a search engine couldn't. Current LLMs, including Claude, appear to be ASL-2.
ASL-3 refers to systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (e.g. search engines or textbooks) OR that show low-level autonomous capabilities.
ASL-4 and higher (ASL-5+) is not yet defined as it is too far from present systems, but will likely involve qualitative escalations in catastrophic misuse potential and autonomy.
The definition, criteria, and safety measures for each ASL level are described in detail in the main document, but at a high level, ASL-2 measures represent our current safety and security standards and overlap significantly with our recent White House commitments. ASL-3 measures include stricter standards that will require intense research and engineering effort to comply with in time, such as unusually strong security requirements and a commitment not to deploy ASL-3 models if they show any meaningful catastrophic misuse risk under adversarial testing by world-class red-teamers (this is in contrast to merely a commitment to perform red-teaming). Our ASL-4 measures aren't yet written (our commitment is to write them before we reach ASL-3), but may require methods of assurance that are unsolved research problems today, such as using interpretability methods to demonstrate mechanistically that a model is unlikely to engage in certain catastrophic behaviors.
We have designed the ASL system to strike a balance between effectively targeting catastrophic risk and incentivising beneficial applications and safety progress. On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models i...
...more
6min
September 18, 2023 AF - Where might I direct promising-to-me researchers to apply for alignment jobs/grants? by Abram Demski
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Where might I direct promising-to-me researchers to apply for alignment jobs/grants?, published by Abram Demski on September 18, 2023 on The AI Alignment Forum.
It sometimes happens that people who I've talked to or worked with ask me where they should go for financial support for their research. I haven't developed a standard list of answers to this question. It seems to me like there are a lot of new orgs recently, and I'm losing track!
If you are looking for such applicants or know someone who is looking, consider replying as an answer (or sending me a PM if that makes more sense for whatever reason).
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
1min
September 18, 2023 AF - Three ways interpretability could be impactful by Arthur Conmy
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Three ways interpretability could be impactful, published by Arthur Conmy on September 18, 2023 on The AI Alignment Forum.
Epistemic status: somewhat confident that overall these arguments show interpretability could be impactful, but not confident on any individual point.
Background: I wrote a draft of this post here after reading Charbel's work. Thanks to Charbel and several others' feedback, the post was improved a lot: I would definitely recommend adversarial collaboration so thanks Charbel! I'm hoping to discuss interpretability's impact further so posting now.
Three ways interpretability could be impactful
Just read the test and validation set parts if you have little time.
1. Test set interpretability: Finding alignment failures
What this is. We want to find alignment failures with interpretability that other forms of auditing do not find. Before deploying a model, we find inputs that seem to elicit new capabilities of the model. Either A) the model has correctly generalized to novel capabilities or B) the model is deceiving our evaluation process to produce convincing but false outputs. We then use our future interpretability tools to try to understand some of the computation the AI is doing to reach these new capabilities. We either become more confident that the AI has correctly generalized, or we've uncovered deceptive computation and we can retrain and/or provide robust evidence of misalignment to relevant actors.
Why this isn't achievable with other methods. If a model can deceive our evaluation process it could be very difficult to observe this deception with mere behavorial evals. See here.
In my mind test set interpretability primarily targets a specific set of alignment failures, illustrated here:
Figure 1.
In the appendix I outline my reasoning behind (interpretability's role in) Figure 1.
2. Validation set interpretability: A better science of alignment.
What this is (thanks to Neel). We have little ground truth on whether our models are misaligned now or how far methods such as RLHF will further scale. More generally, we understand little about how machine learning works, which limits our ability to reason about future systems. Interpretability could first and foremost actually provide evidence for what our alignment techniques are doing (e.g interpreting RLHF reward models) and secondly give us a better evidence base for reasoning about deep learning. I think that Progress Measures for Grokking Via Mechanistic Interpretability has already somewhat changed people's perspectives on how ML models select different algorithms (e.g here, here).
This differs from test set interpretability as it is broader and can be applied before testing potentially misaligned models, to steer the field towards better practises for alignment (Russell and Norvig's validation/test distinction here may be helpful for analogy).
Why this isn't achievable with other methods. If we want to understand how models work for safety-relevant end goals, it seems likely to me that interpretability is the best research direction to pursue. Most methods are merely behavorial and so provide limited ground truth, especially when we are uncertain about deception. For example, I think the existing work trying to make chain-of-thought faithful shows that naive prompting is likely insufficient to understand models' reasoning. Non-behavorial methods such as science of deep learning approaches (e.g singular learning theory, scaling laws) by default give high-level descriptions of neural network statistics such as loss or RLCT. I don't think these approaches are as likely to get as close to answering the questions about internal computations in AIs that I think successful interpretability could lead to. I think some other directions are worthwhile bets due to uncertainty and neglectedness, howe...
...more
11min

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast

9 Listeners

Share The Nonlinear Library: Alignment Forum

Sign up to save your podcasts

The Nonlinear Library: Alignment Forum

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The Nonlinear Library: Alignment Forum episodes:

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast