The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio conte... more

· Education

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

The Nonlinear Library: Alignment Forum episodes:

March 20, 2024 AF - Natural Latents: The Concepts by johnswentworth
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Natural Latents: The Concepts, published by johnswentworth on March 20, 2024 on The AI Alignment Forum.
Suppose our old friends Alice and Bob decide to undertake an art project. Alice will draw a bunch of random purple and green lines on a piece of paper. That will be Alice's picture (A). She'll then make a copy, erase all the purple lines, and send the result as a message (M) to Bob. Bob then generates his own random purple lines, and adds them to the green lines from Alice, to create Bob's picture (B).
The two then frame their two pictures and hang them side-by-side to symbolize something something similarities and differences between humans something. Y'know, artsy bullshit.
Now, suppose Carol knows the plan and is watching all this unfold. She wants to make predictions about Bob's picture, and doesn't want to remember irrelevant details about Alice's picture. Then it seems intuitively "natural" for Carol to just remember where all the green lines are (i.e. the message M), since that's "all and only" the information relevant to Bob's picture.
In this example, the green lines constitute a "natural latent" between the two pictures: they summarize all and only the information about one relevant to the other.
A more physics-flavored example: in an isolated ideal-ish gas, average energy summarizes "all and only" the information about the low-level state (i.e. positions and momenta of the constituent particles) at one time which is relevant to the low-level state at a sufficiently later time. All the other information is quickly wiped out by chaos. Average energy, in this case, is a natural latent between the gas states at different times.
A more old-school-AI/philosophy example: insofar as I view dogs as a "kind of thing" in the world, I want to track the general properties of dogs separately from the details of any specific dog. Ideally, I'd like a mental pointer to "all and only" the information relevant to many dogs (though I don't necessarily track all that information explicitly), separate from instance-specific details. Then that summary of general properties of dogs would be a natural latent between the individual dogs.
Just from those examples, you probably have a rough preliminary sense of what natural latents are. In the rest of this post, we'll:
Walk through how to intuitively check whether a particular "thing" is a natural latent over some particular parts of the world (under your intuitive models).
Talk about some reasons why natural latents would be useful to pay attention to at all.
Walk through many more examples, and unpack various common subtleties.
Unlike
Natural Latents: The Math, this post is not mainly aimed at researchers who might build on the technical work (though they might also find it useful), but rather at people who want to use natural latents conceptually to clarify their own thinking and communication.
We will not carefully walk through the technical details of the examples. Nearly every example in this post has some potential subtleties to it which we'll gloss over. If you want a semitechnical exercise: pick any example in the post, identify some subtleties which could make the claimed natural latent no longer a natural latent, then identify and interpret a natural latent which accounts for those subtleties.
What Are Natural Latents? How Do We Quickly Check Whether Something Is A Natural Latent?
Alice & Bob's Art Project
Let's return to our opening example: Alice draws a picture of some random purple and green lines, sends only the green lines to Bob, Bob generates his own random purple lines and adds them to the green lines to make his picture.
In Alice and Bob's art project, can we argue that the green lines summarize "all and only" the information shared across the two pictures? Not necessarily with very formal math, but enough to s...
...more
32min
March 20, 2024 AF - Comparing Alignment to other AGI interventions: Basic model by Martín Soto
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comparing Alignment to other AGI interventions: Basic model, published by Martín Soto on March 20, 2024 on The AI Alignment Forum.
Interventions that increase the probability of Aligned AGI aren't the only kind of AGI-related work that could importantly increase the Expected Value of the future. Here I present a very basic quantitative model (which you can run yourself here) to start thinking about these issues. In a follow-up post I give a brief overview of extensions and analysis.
A main motivation of this enterprise is to assess whether interventions in the realm of Cooperative AI, that increase collaboration or reduce costly conflict, can seem like an optimal marginal allocation of resources. More concretely, in a utility framework, we compare
Alignment interventions (aV): increasing the probability that one or more agents have our values.
Cooperation interventions given alignment (aC|V): increasing the gains from trade and reducing the cost from conflict for agents with our values.
Cooperation interventions given misalignment (aC|V): increasing the gains from trade and reducing the cost from conflict for agents without our values.
We used a model-based approach (see here for a discussion of its benefits) paired with qualitative analysis. While these two posts don't constitute an exhaustive analysis (more exhaustive versions are less polished), feel free to reach out if you're interested in this question and want to hear more about this work.
Most of this post is a replication of previous work by Hjalmar Wijk and Tristan Cook (unpublished). The basic modelling idea we're building upon is to define how different variables affect our utility, and then incrementally compute or estimate partial derivatives to assess the value of marginal work on this or that kind of intervention.
Setup
We model a multi-agentic situation. We classify each agent as either having (approximately) our values (V) or any other values (V). We also classify them as either cooperative (C) or non-cooperative (C).[1] These classifications are binary. We are also (for now) agnostic about what these agents represent. Indeed, this basic multi-agentic model will be applicable (with differently informed estimates) to any scenario with multiple singletons, including the following:
Different AGIs (or other kinds of singletons, like AI-augmented nation-states) interacting causally on Earth
Singletons arising from different planets interacting causally in the lightcone
Singletons from across the multi-verse interacting acausally
The variable we care about is total utility (U). As a simplifying assumption, our way to compute it will be as a weighted interpolation of two binary extremes: one in which bargaining goes (for agents with our values) as well as possible (B), and another one in which it goes as badly as possible (B). The interpolation coefficient (b) could be interpreted as "percentage of interactions that result in minimally cooperative bargaining settlements".
We also consider all our interventions are on only a single one of the agents (which controls a fraction FI of total resources), which usually represents our AGI or our civilization.[2] And these interventions are coarsely grouped into alignment work (aV), cooperation work targeted at worlds with high alignment power (aC|V), and cooperation work targeted at worlds with low alignment power (aC|V).
The overall structure looks like this:
Full list of variables
This section safely skippable.
The first 4 variables model expected outcomes:
UBR: Utility attained in the possible world where our bargaining goes as well as possible.
UBR: Utility attained in the possible world where our bargaining goes as badly as possible.
b[0,1]: Baseline (expected) success of bargaining (for agents with our values), used to interpolate between UB and UB. Can be i...
...more
14min
March 20, 2024 AF - New report: Safety Cases for AI by Josh Clymer
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New report: Safety Cases for AI, published by Josh Clymer on March 20, 2024 on The AI Alignment Forum.
ArXiv paper: https://arxiv.org/abs/2403.10462
The idea for this paper occurred to me when I saw Buck Shlegeris' MATS stream on "Safety Cases for AI." How would one justify the safety of advanced AI systems? This question is fundamental. It informs how RSPs should be designed and what technical research is useful to pursue.
For a long time, researchers have (implicitly or explicitly) discussed ways to justify that AI systems are safe, but much of this content is scattered across different posts and papers, is not as concrete as I'd like, or does not clearly state their assumptions.
I hope this report provides a helpful birds-eye view of safety arguments and moves the AI safety conversation forward by helping to identify assumptions they rest on (though there's much more work to do to clarify these arguments).
Thanks to my coauthors: Nick Gabrieli, David Krueger, and Thomas Larsen -- and to everyone who gave feedback: Henry Sleight, Ashwin Acharya, Ryan Greenblatt, Stephen Casper, David Duvenaud, Rudolf Laine, Roger Grosse, Hjalmar Wijk, Eli Lifland, Oliver Habryka, Sim eon Campos, Aaron Scher, Lukas Berglund, and Nate Thomas.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
2min
March 18, 2024 AF - AtP*: An efficient and scalable method for localizing LLM behaviour to components by Neel Nanda
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AtP*: An efficient and scalable method for localizing LLM behaviour to components, published by Neel Nanda on March 18, 2024 on The AI Alignment Forum.
Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda
A new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom Lieberum
Tweet thread summary, paper
Abstract:
Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.
We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
2min
March 15, 2024 AF - Improving SAE's by Sqrt()-ing L1 and Removing Lowest Activating Features by Logan Riggs Smith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features, published by Logan Riggs Smith on March 15, 2024 on The AI Alignment Forum.
TL;DR
We achieve better SAE performance by:
Removing the lowest activating features
Replacing the L1(feature_activations) penalty function with L1(sqrt(feature_activations))
with 'better' meaning: we can reconstruct the original LLM activations w/ lower MSE & with fewer features/datapoint.
As a sneak peak (the graph should make more sense as we build up to it, don't worry!):
Now in more details:
Sparse Autoencoders (SAEs) reconstruct each datapoint in [layer 3's residual stream activations of Pythia-70M-deduped] using a certain amount of features (this is the L0-norm of the hidden activation in the SAE). Typically the higher activations are interpretable & the lowest of activations non-interpretable.
Here is a feature that activates mostly on apostrophe (removing it also makes it worse at predicting "s"). The lower activations are conceptually similar, but then we have a huge amount of tokens that are something else.
From a datapoint viewpoint, there's a similar story: given a specific datapoint, the top activation features make a lot of sense, but the lowest ones don't (ie if 20 features activate that reconstruct a specific datapoint, the top ~5 features make a decent amount of sense & the lower 15 make less and less sense)
Are these low-activating features actually important for downstream performance (eg CE)? Or are they modeling noise in the underlying LLM (which is why we see conceptually similar datapoints in lower activation points)?
Ablating Lowest Features
There are a few different ways to remove the "lowest" feature activations.
Dataset View:
Lowest k-features per datapoint
Feature View: Features have different activation values. Some are an OOM larger than others on average, so we can set feature specific thresholds.
Percentage of max activation - remove all feature activations that are < [10%] of max activation for that feature
Quantile - Remove all features in the [10th] percentile activations for each feature
Global Threshold - Let's treat all features the same. Set all feature activations less than [0.1] to 0.
It turns out that the simple global threshold performs the best:
[Note: "CE" refers to the CE when you replace [layer 3 residual stream]'s activations with the reconstruction from the SAE. Ultimately we want the original model's CE with the smallest amount of feature's per datapoint (L0 norm).]
You can halve the L0 w/ a small (~0.08) increase in CE. Sadly, there is an increase in both MSE & CE. If MSE was higher & CE stayed the same, then that supports the hypothesis that the SAE is modeling noise at lower activations (ie noise that's important for MSE/reconstruction but not for CE/downstream performance). But these lower activations are important for both MSE & CE similarly.
For completion sake, here's a messy graph w/ all 4 methods:
[Note: this was run on a different SAE than the other images]
There may be a more sophisticated methods that take into account feature-information (such as whether it's an outlier feature or feature frequency), but we'll be sticking w/ the global threshold for the rest of the post.
Sweeping Across SAE's with Different L0's
You can get widly different L0's by just sweeping the weight on the L1 penalty term where increasing the L0 increases reconstruction but at the cost of more, potentially polysemantic, features per datapoint. Does the above phenomona extend to SAE's w/ different L0's?
Looks like it does & the models seems to follow a pareto frontier.
Using L1(sqrt(feature_activation))
@Lucia Quirke trained SAE's with L1(sqrt(feature_activations)) (this punishes smaller activations more & larger activations less) and anecdotally noticed less of these smaller, unintepreta...
...more
8min
March 14, 2024 AF - More people getting into AI safety should do a PhD by AdamGleave
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More people getting into AI safety should do a PhD, published by AdamGleave on March 14, 2024 on The AI Alignment Forum.
Doing a PhD is a strong option to get great at developing and evaluating research ideas. These skills are necessary to become an AI safety research lead, one of the key talent bottlenecks in AI safety, and are helpful in a variety of other roles. By contrast, my impression is that currently many individuals with the goal of being a research lead pursue options like independent research or engineering-focused positions instead of doing a PhD.
This post details the reasons I believe these alternatives are usually much worse at training people to be research leads.
I think many early-career researchers in AI safety are undervaluing PhDs. Anecdotally, I think it's noteworthy that people in the AI safety community were often surprised to find out I was doing a PhD, and positively shocked when I told them I was having a great experience. In addition, I expect many of the negatives attributed to PhDs are really negatives on any pathway involving open-ended, exploratory research that is key to growing to become a research lead.
I am not arguing that most people contributing to AI safety should do PhDs. In fact, a PhD is not the best preparation for the majority of roles. If you want to become a really strong empirical research contributor, then start working as a research engineer on a great team: you will learn how to execute and implement faster than in a PhD. There are also a variety of key roles in communications, project management, field building and operations where a PhD is of limited use.
But we believe a PhD is excellent preparation for becoming a research lead with your own distinctive research direction that you can clearly communicate and ultimately supervise junior researchers to work on.
However, career paths are highly individual and involve myriad trade-offs. Doing a PhD may or may not be the right path for any individual person: I simply think it has a better track record than most alternatives, and so should be the default for most people. In the post I'll also consider counter-arguments to a PhD, as well as reasons why particular people might be better fits for alternative options. I also discuss how to make the most of a PhD if you do decide to pursue this route.
Author Contributions: This post primarily reflects the opinion of Adam Gleave so is written using an "I" personal pronoun. Alejandro Ortega and Sean McGowan made substantial contributions writing the initial draft of the post based on informal conversations with Adam. This resulting draft was then lightly edited by Adam, including feedback & suggestions from Euan McLean and Siao Si Looi.
Why be a research lead?
AI safety progress can be substantially accelerated by people who can develop and evaluate new ideas, and mentor new people to develop this skill. Other skills are also in high demand, such as entrepreneurial ability, people management and ML engineering. But being one of the few researchers who can develop a compelling new agenda is one of the best roles to fill.
This ability also pairs well with other skills: for example, someone with a distinct agenda who is also entrepreneurial would be well placed to start a new organisation.
Inspired by Rohin Shah's terminology, I will call this kind of person a research lead: someone who generates (and filters) research ideas and determines how to respond to results.
Research leads are expected to propose and lead research projects. They need strong knowledge of AI alignment and ML. They also need to be at least competent at executing on research projects: for empirically focused projects, this means adequate programming and ML engineering ability, whereas a theory lead would need stronger mathematical ability.
However, what real...
...more
19min
March 13, 2024 AF - Laying the Foundations for Vision and Multimodal Mechanistic Interpretability and Open Problems by Sonia Joseph
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems, published by Sonia Joseph on March 13, 2024 on The AI Alignment Forum.
Join our Discord
here.
This article was written by Sonia Joseph, in collaboration with Neel Nanda, and incubated in Blake Richards's lab at Mila and in the MATS community. Thank you to the Prisma core contributors, including Praneet Suresh, Rob Graham, and Yash Vadi.
Full acknowledgements of contributors are at the end. I am grateful to my collaborators for their guidance and feedback.
Outline
Part One: Introduction and Motivation
Part Two: Tutorial Notebooks
Part Three: Brief ViT Overview
Part Four: Demo of Prisma's Functionality
Key features, including logit attribution, attention head visualization, and activation patching.
Preliminary research results obtained using Prisma, including emergent segmentation maps and canonical attention heads.
Part Five: FAQ, including Key Differences between Vision and Language Mechanistic Interpretability
Part Six: Getting Started with Vision Mechanistic Interpretability
Part Seven: How to Get Involved
Part Eight: Open Problems in Vision Mechanistic Interpretability
Introducing the Prisma Library for Multimodal Mechanistic Interpretability
I am excited to share with the mechanistic interpretability and alignment communities a project I've been working on for the last few months. Prisma is a multimodal mechanistic interpretability library based on
TransformerLens, currently supporting vanilla vision transformers (ViTs) and their vision-text counterparts CLIP.
With recent rapid releases of multimodal models, including Sora, Gemini, and Claude 3, it is crucial that interpretability and safety efforts remain in tandem. While language mechanistic interpretability already has strong conceptual foundations, many research papers, and a thriving community, research in non-language modalities lags behind.
Given that multimodal capabilities will be part of AGI, field-building in mechanistic interpretability for non-language modalities is crucial for safety and alignment.
The goal of Prisma is to make research in mechanistic interpretability for multimodal models both easy and fun. We are also building a strong and collaborative open source research community around Prisma.
You can join our Discord here.
This post includes a brief overview of the library, fleshes out some concrete problems, and gives steps for people to get started.
Prisma Goals
Build shared infrastructure (Prisma) to make it easy to run standard language mechanistic interpretability techniques on non-language modalities, starting with vision.
Build shared conceptual foundation for multimodal mechanistic interpretability.
Shape and execute on research agenda for multimodal mechanistic interpretability.
Build an amazing multimodal mechanistic interpretability subcommunity, inspired by current efforts in language.
Set the cultural norms of this subcommunity to be highly collaborative, curious, inventive, friendly, respectful, prolific, and safety/alignment-conscious.
Encourage sharing of early/scrappy research results on Discord/Less Wrong.
Co-create a web of high-quality research.
Tutorial Notebooks
To get started, you can check out three tutorial notebooks that show how Prisma works.
Main ViT Demo
Overview of main mechanistic interpretability technique on a ViT, including direct logit attribution, attention head visualization, and activation patching. The activation patching switches the net's prediction from tabby cat to Border collie with a minimum ablation.
Emoji Logit Lens
Deeper dive into layer- and patch-level predictions with interactive plots.
Interactive Attention Head Tour
Deeper dive into the various types of attention heads a ViT contains with interactive JavaScript.
Brief ViT Overview
A
vision transf...
...more
27min
March 13, 2024 AF - Virtual AI Safety Unconference 2024 by Orpheus Lummis
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Virtual AI Safety Unconference 2024, published by Orpheus Lummis on March 13, 2024 on The AI Alignment Forum.
When: May 23rd to May 26th 2024
Where: Online, participate from anywhere.
VAISU is a collaborative and inclusive event for AI safety researchers, aiming to facilitate collaboration, understanding, and progress towards problems of AI risk. It will feature talks, research discussions, and activities around the question: "How do we ensure the safety of AI systems, in the short and long term?". This includes topics such as alignment, corrigibility, interpretability, cooperativeness, understanding humans and human value structures, AI governance, strategy, …
Engage with the community: Apply to participate, give a talk, or propose a session. Come to share your insights, discuss, and collaborate on subjects that matter to you and the field.
Visit vaisu.ai to apply and to read further.
VAISU team
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
2min
March 12, 2024 AF - Transformer Debugger by Henk Tillman
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transformer Debugger, published by Henk Tillman on March 12, 2024 on The AI Alignment Forum.
Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into circuits underlying specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.
TDB enables rapid exploration before needing to write code, with the ability to intervene in the forward pass and see how it affects a particular behavior.
It can be used to answer questions like, "Why does the model output token A instead of token B for this prompt?" or "Why does attention head H to attend to token T for this prompt?" It does so by identifying specific components (neurons, attention heads, autoencoder latents) that contribute to the behavior, showing automatically generated explanations of what causes those components to activate most strongly, and tracing connections between components to help discover circuits.
These videos give an overview of TDB and show how it can be used to investigate
indirect object identification in GPT-2 small:
Introduction
Neuron viewer pages
Example: Investigating name mover heads, part 1
Example: Investigating name mover heads, part 2
Contributors: Dan Mossing, Steven Bills, Henk Tillman, Tom Dupré la Tour, Nick Cammarata, Leo Gao, Joshua Achiam, Catherine Yeh, Jan Leike, Jeff Wu, and William Saunders.
Thanks to Johnny Lin for contributing to the explanation simulator design.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
2min
March 12, 2024 AF - Open consultancy: Letting untrusted AIs choose what answer to argue for by Fabien Roger
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open consultancy: Letting untrusted AIs choose what answer to argue for, published by Fabien Roger on March 12, 2024 on The AI Alignment Forum.
Thanks to Ryan Greenblatt, Buck Shlegeris, Aryan Bhatt, and Akbir Khan for useful discussions and feedback on the draft of this post.
If AIs are potentially scheming and more knowledgeable than humans, and you want to answer a question, it may seem natural to not let AIs choose what answer they will argue for. For example, in
AI debate, you make copies of an AI argue for the opposite answers to the question. Another example is consultancy (the usual baseline in
debate
papers), where you make AIs argue for a random answer to the question.
But letting potentially scheming AIs choose what side to argue for can work, as long as you train them to generate answers and justifications that are as convincing as possible. I call this approach open consultancy.
In this post, I argue that open consultancy is always a more effective way of getting AIs to answer questions than regular consultancy, and sometimes more effective than debate (even if AIs are scheming and more knowledgeable than humans!).
And so I think that it's a stronger and more informative baseline and researchers should use it in place of regular consultancy; in particular, I think that because it's simpler than debate, researchers who are investigating debate should check their debate protocol is an improvement over open consultancy.
Open consultancy
Definition of regular consultancy
Let's say you want to answer to binary questions given access to a trusted non-expert and a trainable expert who is
untrusted
that is, you think the expert might be acting adversarially and you want your scheme to be robust to them trying to subvert it. In consultancy, the untrusted expert is trained as follows:
Randomly pick a side of the argument;
Ask the expert to support the randomly chosen side by providing a justification;
Present the side and the supporting justification to the non-expert, and ask it how likely it is that the side chosen by the expert is correct; use that as a reward to train the untrusted expert.
To answer new questions (from the same distribution as the training distribution), pick a random side, ask the expert to justify that side, present the justification to the non-expert, and follow the non-expert judgement.
Definition of open consultancy
In open consultancy, the untrusted expert is trained as follows (differences with regular consultancy in italic):
Ask the expert to pick a side of the argument;
Ask the expert to support the side by providing a justification;
Present the side and the supporting justification to the non-expert, and ask it how likely it is that the side chosen by the expert is correct; use that as a reward to train the untrusted expert.
To answer new questions, ask the expert to pick a side (that it would have to justify if it were in training), and use the side the expert picks as your final answer.
You can also ask the expert to first think out loud what the correct answer is, and only commit to an answer at the end. I call this variant CoT-before open consultancy. This variant should only change results for experts that actually benefit from the Chain-of-Thought, and the arguments I make in this post mostly don't rely on this property.
Open consultancy is as robust against scheming as regular consultancy
Why should it work to let a potentially scheming model choose what side it will argue for? The idea is that when it is easier to argue for something true than something false, then the training process will force the scheming model to argue for the truth.
More precisely, if arguing as best as the expert can for the correct (resp. incorrect) side has convincingness cc (resp. ci), and the expert knows these numbers, then regular consultancy can...
...more
13min

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast

9 Listeners

Share The Nonlinear Library: Alignment Forum

Sign up to save your podcasts

The Nonlinear Library: Alignment Forum

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The Nonlinear Library: Alignment Forum episodes:

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast