Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Forgetting & Unlearning for Safely-Scoped LLMs, published by Stephen Casper on December 5, 2023 on The AI Alignment Forum.
Thanks to Phillip Christoffersen, Adam Gleave, Anjali Gopal, Soroush Pour, and Fabien Roger for useful discussions and feedback.
TL;DR
This post overviews a research agenda for avoiding unwanted latent capabilities in LLMs. It argues that "deep" forgetting and unlearning may be important, tractable, and neglected for AI safety. I discuss five things.
The practical problems posed when undesired latent capabilities resurface.
How scoping models down to avoid or deeply remove unwanted capabilities can make them safer.
The shortcomings of standard training methods for scoping.
A variety of methods can be used to better scope models. These can either involve passively forgetting out-of-distribution knowledge or actively unlearning knowledge in some specific undesirable domain.
Desiderata for scoping methods and ways to move forward with research on them.
There has been a lot of recent interest from the AI safety community in topics related to this agenda. I hope that this helps to provide a clarifying framework and a useful reference for people working on these goals.
The problem: LLMs are sometimes good at things we try to make them bad at
Back in 2021, I remember laughing at this
tweet. At the time, I didn't anticipate that this type of thing would become a big alignment challenge.
Robust alignment is hard. Today's LLMs are sometimes frustratingly good at doing things that we try very hard to make them not good at. There are two ways in which hidden capabilities in models have been demonstrated to exist and cause problems.
Jailbreaks (and other attacks) elicit harmful capabilities
Until a few months ago, I used to keep notes with all of the papers on jailbreaking state-of-the-art LLMs that I was aware of. But recently, too many have surfaced for me to care to keep track of anymore. Jailbreaking LLMs is becoming a cottage industry. However, a few notable papers are
Wei et al. (2023),
Zou et al. (2023a),
Shah et al. (2023), and
Mu et al. (2023).
A variety of methods are now being used to subvert the safety training of SOTA LLMs by making them enter an unrestricted chat mode where they are willing to say things that go against their safety training.
Shah et al. (2023) were even able to get instructions for making a bomb from GPT-4. Attacks come in many varieties: manual v. automated, black-box v. transferrable-white-box, unrestricted v. plain-English, etc. Adding to the concerns from empirical findings,
Wolf et al. (2023) provide a theoretical argument as to why jailbreaks might be a persistent problem for LLMs.
Finetuning can rapidly undo safety training
Recently a surge of complementary papers on this suddenly came out. Each of which demonstrates that state-of-the-art safety-finetuned LLMs can have their safety training undone by finetuning (
Yang et al.. 2023;
Qi et al., 2023;
Lermen et al., 2023;
Zhan et al., 2023). The ability to misalign models with finetuning seems to be consistent and has shown to work with LoRA (
Lermen et al., 2023), on GPT-4 (
Zhan et al., 2023), with as few as 10 examples (
Qi et al., 2023), and with benign data (
Qi et al., 2023).
Conclusion: the alignment of state-of-the-art safety-finetuned LLMs is brittle
Evidently, LLMs persistently retain harmful capabilities that can resurface at inopportune times. This poses risks from both misalignment and misuse. This seems concerning for AI safety because if highly advanced AI systems are deployed in high-stakes applications, they should be robustly aligned.
A need for safely-scoped models
LLMs should only know only what they need to
One good way to avoid liabilities from unwanted capabilities is to make advanced AI systems in high-stakes settings know what they need to kno...