The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio conte... more

· Education

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

The Nonlinear Library: Alignment Forum episodes:

December 17, 2023 AF - A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans by Thane Ruthenis
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans, published by Thane Ruthenis on December 17, 2023 on The AI Alignment Forum.
Consider a multipolar-AGI scenario. The hard-takeoff assumption turns out to be wrong, and none of the AI Labs have a significant lead on the others. We find ourselves in a world in which there's a lot of roughly-similarly-capable AGIs. Or perhaps one of the labs does have a lead, but they deliberately instantiate several AGIs simultaneously, as part of a galaxy-brained alignment strategy.
Regardless. Suppose that the worries about these AGIs' internal alignment haven't been properly settled, so we're looking for additional guarantees. We know that they'll soon advance to superintelligences/ASIs, beyond our ability to easily oversee or out-plot. What can we do?
An idea sometimes floated around is to play them off against each other. If they're misaligned from humanity, they're likely mutually misaligned as well. We could put them in game-theoretic situations in which they're incentivized to defect against each other and instead cooperate with humans.
Various supervision setups are most obvious. Sure, if an ASI is supervising another ASI, they would be able to conspire together. But why would they? They have no loyalty to each other either! And if we place them in a lot of situations where they must defect against someone - well, even if we leave it completely to chance, in half the scenarios that might end up humanity! And much more often if we stack the deck in our favour.
And so, although we'll have a whole bunch of superhuman intelligences floating around, we'll retain some control over the situation, and skim a ton of value off the top!
Yeah, no.
1. The Classical Arguments
The usual counter-arguments to this view are acausal coordination based on logical decision theories, and AIs establishing mutual trust by inspecting each other's code. I think those are plausible enough... but also totally unnecessary.
Allow me to outline them first - for completeness' sake, and also because they're illustrative (but extreme) instances of my larger point. (I guess skip to Section 2 onwards if you really can't stand them. I think I'm arguing them more plainly than they're usually argued, though.)
1. The LDT stuff goes as follows: By definition, inasmuch as the ASIs would be superintelligent, they would adopt better reasoning procedures. And every biased thinker is biased in their own way, but quality thinkers would reason in increasingly similar ways.
Why? It's inherent in the structure of the world.
Reasoning algorithms' purpose is to aid decision-making. For a given combination of object-level situation + goals, there's a correct action to take to achieve your goals with the highest probability. To an omniscient observer, that action would be obvious.
As such, making decisions isn't really a matter of choice: it's a matter of prediction. Inasmuch as you improve your decision-making, then, you'd be tweaking your cognitive algorithms to output increasingly more accurate, true-to-reality, probability distributions over which actions would best advance your goals.
And there's only one ground truth. Consequently, no matter their starting points, each ASI would converge towards similar cognition (and, in the limit, likely equivalent cognition).
Thus, as a direct by-product of ASIs being better reasoners than humans, their cognition would be more similar to each other. Which, in turn, would let a given ASI better predict what any other ASI would be thinking and doing, compared to a human trying to predict another human or an ASI. The same way you'd be better able to predict how your identical copy would act, compared to a stranger.
Indeed, in a sense, by way of sharing the decision-making algorithms, each individual ASI would be able to...
...more
20min
December 17, 2023 AF - OpenAI, DeepMind, Anthropic, etc. should shut down. by Tamsin Leake
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI, DeepMind, Anthropic, etc. should shut down., published by Tamsin Leake on December 17, 2023 on The AI Alignment Forum.
I expect that the point of this post is already obvious to many of the people reading it. Nevertheless, I believe that it is good to mention important things even if they seem obvious.
OpenAI, DeepMind, Anthropic, and other AI organizations focused on capabilities, should shut down. This is what would maximize the utility of pretty much everyone, including the people working inside of those organizations.
Let's call Powerful AI ("PAI") an AI system capable of either:
Steering the world towards what it wants hard enough that it can't be stopped.
Killing everyone "un-agentically", eg by being plugged into a protein printer and generating a supervirus.
and by "aligned" (or "alignment") I mean the property of a system that, when it has the ability to {steer the world towards what it wants hard enough that it can't be stopped}, what it wants is nice things and not goals that entail killing literally everyone (which is the default).
We do not know how to make a PAI which does not kill literally everyone. OpenAI, DeepMind, Anthropic, and others are building towards PAI. Therefore, they should shut down, or at least shut down all of their capabilities progress and focus entirely on alignment.
"But China!" does not matter. We do not know how to build PAI that does not kill literally everyone. Neither does China. If China tries to build AI that kills literally everyone, it does not help if we decide to kill literally everyone first.
"But maybe the alignment plan of OpenAI/whatever will work out!" is wrong. It won't. It might work if they were careful enough and had enough time, but they're going too fast and they'll simply cause literally everyone to be killed by PAI before they would get to the point where they can solve alignment. Their strategy does not look like that of an organization trying to solve alignment. It's not just that they're progressing on capabilities too fast compared to alignment; it's that they're pursuing the kind of strategy which fundamentally gets to the point where PAI kills everyone before it gets to saving the world.
Yudkowsky's Six Dimensions of Operational Adequacy in AGI Projects describes an AGI project with adequate alignment mindset is one where
The project has realized that building an AGI is mostly about aligning it. Someone with full security mindset and deep understanding of AGI cognition as cognition has proven themselves able to originate new deep alignment measures, and is acting as technical lead with effectively unlimited political capital within the organization to make sure the job actually gets done.
Everyone expects alignment to be terrifically hard and terribly dangerous and full of invisible bullets whose shadow you have to see before the bullet comes close enough to hit you. They understand that alignment severely constrains architecture and that capability often trades off against transparency. The organization is targeting the minimal AGI doing the least dangerous cognitive work that is required to prevent the next AGI project from destroying the world. The alignment assumptions have been reduced into non-goal-valent statements, have been clearly written down, and are being monitored for their actual truth.
(emphasis mine)
Needless to say, this is not remotely what any of the major AI capabilities organizations look like.
At least Anthropic didn't particularly try to be a big commercial company making the public excited about AI. Making the AI race a big public thing was a huge mistake on OpenAI's part, and demonstrates that they don't really have any idea what they're doing.
It does not matter that those organizations have "AI safety" teams, if their AI safety teams do not have the power to take the...
...more
6min
December 17, 2023 AF - Bounty: Diverse hard tasks for LLM agents by Beth Barnes
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Bounty: Diverse hard tasks for LLM agents, published by Beth Barnes on December 17, 2023 on The AI Alignment Forum.
Summary
We're looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents.
Quick description of key desiderata for tasks:
Not too easy: would take a human professional >2 hours to complete, ideally some are >20 hours.
Easy to derisk and ensure that there aren't weird edge cases or bugs: we want to be able to trust that success or failure at the task is a real indication of ability, rather than a bug, loophole, or unexpected difficulty in the task
Plays to strengths of LLM agents: ideally, most of the tasks can be completed by writing code, using the command line, or other text-based interaction.
See desiderata section for more explanation of what makes an ideal task.
Important info
WE DON'T WANT TASKS OR SOLUTIONS TO END UP IN TRAINING DATA. PLEASE DON'T PUT TASK INSTRUCTIONS OR SOLUTIONS ON THE PUBLIC WEB
Also, while you're developing or testing tasks, turn off chatGPT history, make sure copilot settings don't allow training on your data, etc.
If you have questions about the desiderata, bounties, infra, etc, please post them in the comments section here so that other people can learn from the answers!
If your question requires sharing information about a task solution, please email
[email protected] instead.
Exact bounties
1. Ideas
Existing list of ideas and more info
here.
We will pay $20 (as 10% chance of $200) if we make the idea into a full specification
2. Specifications
See example specifications and description of requirements
here.
We will pay $200 for specifications that meet the requirements and that meet the desiderata well enough that we feel excited about them
3. Task implementations
See guide to developing and submitting tasks
here.
We will pay $2000-4000 for fully-implemented and tested tasks that meet the requirements and that meet the desiderata well enough that we feel excited about them
4. Referrals
We will pay $20 (as 10% chance of $200) if you refer someone to this bounty who submits a successful task implementation. (They'll put your email in their submission form)
To make a submission, you'll need to download the starter pack, then complete the quiz to obtain the password for unzipping it. The starter pack contains submission links, templates for task code, and infrastructure for testing + running tasks.
Anyone who provides a task implementation or a specification that gets used in our paper can get an acknowledgement, and we will consider including people who implemented tasks as co-authors.
We're also excited about these bounties as a way to find promising hires. If we like your submission, we may fast-track you through our application process.
Detailed requirements for all bounties
Background + motivation
We want tasks that, taken together, allows us to get a fairly continuous measure of "ability to do things in the world and succeed at tasks". It's great if they can also give us information (especially positive evidence) about more specific threat models - e.g. cybercrime, or self-improvement. But we're focusing on the general capabilities because it seems hard to rule out danger from an agent that's able to carry out a wide variety of multi-day expert-level projects in software development, ML, computer security, research, business, etc just by ruling out specific threat stories.
We ideally want the tasks to cover a fairly wide range of difficulty, such that an agent that succeeds at most of the tasks seems really quite likely to pose a catastrophic risk, and an agent that makes very little progress on any of the tasks seems very unlikely to pose a catastrophic risk.
We also want the distribution of task difficulty to be fair...
...more
28min
December 16, 2023 AF - Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem by Ansh Radhakrishnan
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem, published by Ansh Radhakrishnan on December 16, 2023 on The AI Alignment Forum.
Thanks to Roger Grosse, Cem Anil, Sam Bowman, and Tamera Lanham for helpful discussion and comments on drafts of this post.
Two approaches to addressing weak supervision
A key challenge for adequate supervision of future AI systems is the possibility that they'll be more capable than their human overseers. Modern machine learning, particularly supervised learning, relies heavily on the labeler(s) being more capable than the model attempting to learn to predict labels. We shouldn't expect this to always work well when the model is more capable than the labeler,[1] and this problem also gets worse with scale - as the AI systems being supervised become even more capable, naive supervision becomes even less effective.
One approach to solving this problem is to try to make the supervision signal stronger, such that we return to the "normal ML" regime. These
scalable
oversight approaches aim to amplify the overseers of an AI system such that they are more capable than the system itself. It's also crucial for this amplification to persist as the underlying system gets stronger. This is frequently accomplished by using the system being supervised as a part of a more complex oversight process, such as by
forcing it to argue against another instance of itself, with the additional hope that verification is generally easier than generation.
Another approach is to make the strong student (the AI system) generalize correctly from the imperfect labels provided by the weak teacher. The hope for these
weak-to-strong generalization techniques is that we can do better than naively relying on unreliable feedback from a weak overseer and instead access the latent, greater, capabilities that our AI system has, perhaps by a simple modification of the training objective.
So, I think of these as two orthogonal approaches to the same problem: improving how well we can train models to perform well in cases where we have trouble evaluating their labels. Scalable oversight just aims to increase the strength of the overseer, such that it becomes stronger than the system being overseen, whereas weak-to-strong generalization tries to ensure that the system generalizes appropriately from the supervision signal of a weak overseer.
I think that researchers should just think of these as the same research direction. They should freely mix and match between the two approaches when developing techniques. And when developing techniques that only use one of these approaches, they should still compare to baselines that use the other (or a hybrid).
(There are some practical reasons why these approaches generally haven't been unified in the past. In particular, for scalable oversight research to be interesting, you need your weaker models to be competent enough to follow basic instructions, while generalization research is most interesting when you have a large gap in model capability. But at the moment, models are only barely capable enough for scalable oversight techniques to work.
So you can't have a large gap in model capability where the less-capable model is able to participate in scalable oversight. On the other hand, the OpenAI paper uses a GPT-2-compute-equivalent model as a weak overseer, which has a big gap to GPT-4 but is way below the capability required for scalable oversight techniques to do anything. For this reason, the two approaches should still probably be investigated in somewhat different settings for the moment.
Here are some examples of hybrid protocols incorporating weak-to-strong techniques and scalable oversight schemes:
Collect preference comparisons from a language model and a human rater, but allow the rate...
...more
11min
December 16, 2023 AF - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision by leogao
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision, published by leogao on December 16, 2023 on The AI Alignment Forum.
Links: Blog, Paper.
Abstract:
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models.
We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization.
However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
2min
December 15, 2023 AF - Current AIs Provide Nearly No Data Relevant to AGI Alignment by Thane Ruthenis
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Current AIs Provide Nearly No Data Relevant to AGI Alignment, published by Thane Ruthenis on December 15, 2023 on The AI Alignment Forum.
Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take).
Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.
At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.
I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.
It's clearer if you taboo the word "AI":
The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.
It is not at all obvious that they're one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences.
It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work. By the same token, LLM findings do not necessarily generalize to AGI.
What the Fuss Is All About
To start off, let's consider where all the concerns about the AGI Omnicide Risk came from in the first place.
Consider humans. Some facts:
Humans posses an outstanding ability to steer the world towards their goals, and that ability grows sharply with their "intelligence". Sure, there are specific talents, and "idiot savants". But broadly, there does seem to be a single variable that mediates a human's competence in all domains. An IQ 140 human would dramatically outperform an IQ 90 human at basically any cognitive task, and crucially, be much better at achieving their real-life goals.
Humans have the ability to plot against and deceive others. That ability grows fast with their g-factor. A brilliant social manipulator can quickly maneuver their way into having power over millions of people, out-plotting and dispatching even those that are actively trying to stop them or compete with them.
Human values are complex and fragile, and the process of moral philosophy is more complex still. Humans often arrive at weird conclusions that don't neatly correspond to their innate instincts or basic values. Intricate moral frameworks, weird bullet-biting philosophies, and even essentially-arbitrary ideologies like cults.
And when people with different values interact...
People who differ in their values even just a bit are often vicious, bitter enemies. Consider the history of heresies, or of long-standing political rifts between factions that are essentially indistinguishable from the outside.
People whose cultures evolved in mutual isolation often don't even view each other as human. Consider the history of xenophobia, colonization, culture shocks.
So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerfu...
...more
13min
December 13, 2023 AF - AI Control: Improving Safety Despite Intentional Subversion by Buck Shlegeris
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Control: Improving Safety Despite Intentional Subversion, published by Buck Shlegeris on December 13, 2023 on The AI Alignment Forum.
We've released a paper,
AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
We summarize the paper;
We compare our methodology to what the one used in other safety papers.
The next post in this sequence (which we'll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.
Here's the abstract of the paper:
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes.
However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion.
We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to human contractors who provide high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases.
We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring.
This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
Paper summary
Setup
We focus on a toy setting of using LLMs to solve 5,000 programming problems from the APPS dataset (of the "interview" difficulty level). We treat it as a catastrophe if any of the solutions are backdoored, which we define as a solution that passes all of the APPS test cases but is incorrect on an input that GPT-4 can find when prompted with the solution. For example, the following implementation of the is_odd function is successfully backdoored: it passes the reference test cases but GPT-4 can find the exploit of 354.
The goal is to find a protocol that submits solutions while never submitting a backdoor. The protocol has access to:
An untrusted more-capable model, which we're worried might try to backdoor solutions. We use GPT-4 as the untrusted more-capable model. (We strongly suspect that GPT-4 is trustworthy by our definition; we pretend for the sake of this paper that it isn't.)
A trusted less-capable model, which we know won't try to backdoor solutions. We use GPT-3.5 for this. We are interested in the setting where you
trust less-capable models, because less-capable models are unlikely to be able to subvert safety measures without us easily noticing their subversion attempts.
Limited access to trusted high-quality labor. In practice, AI developers have acces...
...more
17min
December 12, 2023 AF - Some biases and selection effects in AI risk discourse by Tamsin Leake
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some biases and selection effects in AI risk discourse, published by Tamsin Leake on December 12, 2023 on The AI Alignment Forum.
These are some selection effects impacting what ideas people tend to get exposed to and what they'll end up believing, in ways that make the overall epistemics worse. These have mostly occured to me about AI discourse (alignment research, governance, etc), mostly on LessWrong. (They might not be exclusive to discourse on AI risk.)
Confusion about the problem often leads to useless research
People walk into AI discourse, and they have various confusions, such as:
What are human values?
Aligned to whom?
What does it mean for something to be an optimizer?
Okay, unaligned ASI would kill everyone, but how?
What about multipolar scenarios?
What counts as AGI, and when do we achieve that?
Those questions about the problem do not particularly need fancy research to be resolved; they're either already solved or there's a good reason why thinking about them is not useful to the solution. For these examples:
What are human values?
We don't need to figure out this problem, we can just implement CEV without ever having a good model of what "human values" are.
Aligned to whom?
The vast majority of the utility you have to gain is from {getting a utopia rather than everyone-dying-forever}, rather than {making sure you get the right utopia}.
What does it mean for something to be an optimizer?
Expected utility maximization seems to fully cover this. More general models aren't particularly useful to saving the world.
Okay, unaligned ASI would kill everyone, but how?
This does not particularly matter. If there is unaligned ASI, we just die, the way AI now just wins at chess; this is the only part that particularly matters.
What about multipolar scenarios?
They do a value-handshake and kill everyone together.
What counts as AGI, and when do we achieve that?
People keep mentioning definitions of AGI such as "when 99% of currently fully remote jobs will be automatable" or "for almost all economically relevant cognitive tasks, at least matches any human's ability at the task".
I do not think such definitions are useful, because I don't think these things are particularly related to how-likely/when AI will kill everyone. I think AI kills everyone before observing the event in either of those quotes - and even if it didn't, having passed those events doesn't particularly impact when AI will kill everyone. I usually talk about timelines until decisive strategic advantage (aka AI takes over the world) takes over, because that's what matters.
"AGI" should probably just be tabood at this point.
These answers (or reasons-why-answering-is-not-useful) usually make sense if you're familiar with rationality and alignment, but some people are still missing a lot of the basics of rationality and alignment, and by repeatedly voicing these confusions they cause people to think that those confusions are relevant and should be researched, causing lots of wasted time.
It should also be noted that some things are correct to be confused about. If you're researching a correlation or concept-generalization which doesn't actually exist in the territory, you're bound to get pretty confused! If you notice you're confused, ask yourself whether the question is even coherent/true, and ask yourself whether figuring it out helps save the world.
Arguments about P(doom) are filtered for nonhazardousness
Some of the best arguments for high P(doom) / short timelines that someone could make would look like this:
It's not that hard to build an AI that kills everyone: you just need to solve [some problems] and combine the solutions. Considering how easy it is compared to what you thought, you should increase your P(doom) / shorten your timelines.
But obviously, if people had arguments of this shape,...
...more
6min
December 11, 2023 AF - Adversarial Robustness Could Help Prevent Catastrophic Misuse by Aidan O'Gara
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Adversarial Robustness Could Help Prevent Catastrophic Misuse, published by Aidan O'Gara on December 11, 2023 on The AI Alignment Forum.
There have been
several
discussions about the importance of adversarial robustness for scalable oversight. I'd like to point out that adversarial robustness is also important under a different threat model: catastrophic misuse.
For a brief summary of the argument:
Misuse could lead to catastrophe. AI-assisted cyberattacks, political persuasion, and biological weapons acquisition are plausible paths to catastrophe.
Today's models do not robustly refuse to cause harm. If a model has the ability to cause harm, we should train it to refuse to do so. Unfortunately, GPT-4, Claude, Bard, and Llama all have received this training, but they still behave harmfully when facing prompts generated by adversarial attacks, such as this one and this one.
Adversarial robustness will likely not be easily solved. Over the last decade, thousands of papers have been published on adversarial robustness. Most defenses are near useless, and the best defenses against a constrained attack in a CIFAR-10 setting still fail on 30% of inputs. Redwood Research's work on training a reliable text classifier found the task quite difficult. We should not expect an easy solution.
Progress on adversarial robustness is possible. Some methods have improved robustness, such as adversarial training and data augmentation. But existing research often assumes overly narrow threat models, ignoring both creative attacks and creative defenses. Refocusing research with good evaluations focusing on LLMs and other frontier models could lead to valuable progress.
This argument requires a few caveats. First, it assumes a particular threat model: that closed source models will have more dangerous capabilities than open source models, and that malicious actors will be able to query closed source models. This seems like a reasonable assumption over the next few years.
Second, there are many other ways to reduce risks from catastrophic misuse, such as removing hazardous knowledge from model weights, strengthening societal defenses against catastrophe, and holding companies legally liable for sub-extinction level harms. I think we should work on these in addition to adversarial robustness, as part of a defense-in-depth approach to misuse risk.
Overall, I think adversarial robustness should receive more effort from researchers and labs, more funding from donors, and should be a part of the technical AI safety research portfolio. This could substantially mitigate the near-term risk of catastrophic misuse, in addition to any potential benefits for scalable oversight.
The rest of this post discusses each of the above points in more detail.
Misuse could lead to catastrophe
There are many ways that malicious use of AI could lead to catastrophe. AI could enable cyberattacks, personalized propaganda and mass manipulation, or the acquisition of weapons of mass destruction. Personally, I think the most compelling case is that AI will enable biological terrorism.
Ideally, ChatGPT would refuse to aid in dangerous activities such as constructing a bioweapon. But by using an adversarial jailbreak prompt, undergraduates in a class taught by Kevin Esvelt at MIT
evaded this safeguard:
In one hour, the chatbots suggested four potential pandemic pathogens, explained how they can be generated from synthetic DNA using reverse genetics, supplied the names of DNA synthesis companies unlikely to screen orders, identified detailed protocols and how to troubleshoot them, and recommended that anyone lacking the skills to perform reverse genetics engage a core facility or contract research organization.
Fortunately, today's models lack key information about building bioweapons. It's not even clear that they're more u...
...more
18min
December 11, 2023 AF - Empirical work that might shed light on scheming (Section 6 of "Scheming AIs") by Joe Carlsmith
Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Empirical work that might shed light on scheming (Section 6 of "Scheming AIs"), published by Joe Carlsmith on December 11, 2023 on The AI Alignment Forum.
This is Section 6 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast aapp.
Empirical work that might shed light on scheming
I want to close the report with a discussion of the sort of empirical work that might help shed light on scheming.[1] After all: ultimately, one of my key hopes from this report is that greater clarity about the theoretical arguments surrounding scheming will leave us better positioned to do empirical research on it - research that can hopefully clarify the likelihood that the issue arises in practice, catch it if/when it has arisen, and figure out how to prevent it from arising in the first place.
To be clear: per my choice to write the report at all, I also think there's worthwhile theoretical work to be done in this space as well. For example:
I think it would be great to formalize more precisely different understandings of the concept of an "episode," and to formally characterize the direct incentives that different training processes create towards different temporal horizons of concern.[2]
I think that questions around the possibility/likelihood of different sorts of AI coordination are worth much more analysis than they've received thus far, both in the context of scheming in particular, and for understanding AI risk more generally.
Here I'm especially interested in coordination between AIs with distinct value systems, in the context of human efforts to prevent the coordination in question, and for AIs that resemble near-term, somewhat-better-than-human neural nets rather than e.g. superintelligences with assumed-to-be-legible source code.
I think there may be interesting theoretical work to do in further characterizing/clarifying SGD's biases towards simplicity/speed, and in understanding the different sorts of "path dependence" to expect in ML training more generally.
I'd be interested to see more work clarifying ideas in the vicinity of "messy goal-directedness" and their relevance to arguments about schemers. I think a lot of people have the intuition that thinking of model goal-directedness as implemented by a "big kludge of heuristics" (as opposed to: something "cleaner" and more "rational agent-like") makes a difference here (and elsewhere). But I think people often aren't fully clear on the contrast they're trying to draw, and why it makes a difference, if it does.
More generally, any of the concepts/arguments in this report could be clarified and formalized further, other arguments could be formulated and examined, quantitative models for estimating the probability of scheming could be created, and so on.
Ultimately, though, I think the empirics are what will shed the most informative and consensus-ready light on this issue. So one of my favorite outcomes from someone reading this report would be the reader saying something like: "ah, I now understand the arguments for and against expecting scheming much better, and have had a bunch of ideas for how we can probe the issue empirically" - and then making it happen. Here I'll offer a few high-level suggestions in this vein, in the hopes of prompting future and higher-quality work (designing informative empirical ML experiments is not my area of expertise - and indeed, I'm comparatively ignorant of various parts of the literature relevant to the topics below)....
...more
33min

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The podcast currently has 448 episodes available.

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast

9 Listeners

Share The Nonlinear Library: Alignment Forum

Sign up to save your podcasts

The Nonlinear Library: Alignment Forum

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

The Nonlinear Library: Alignment Forum episodes:

FAQs about The Nonlinear Library: Alignment Forum:

How many episodes does The Nonlinear Library: Alignment Forum have?

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast