The Nonlinear Library: Alignment Forum

AF - How model editing could help with the alignment problem by Michael Ripa


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How model editing could help with the alignment problem, published by Michael Ripa on September 30, 2023 on The AI Alignment Forum.
Preface
This article explores the potential of model editing techniques in aligning future AI systems. Initially, I was skeptical about its efficacy, especially considering the objectives of current model editing methods. I argue that merely editing "facts" isn't an adequate alignment strategy and end with suggestions for research avenues focused on alignment-centric model editing.
Thanks to Stephen Casper, Nicolas Gatien and Jason Hoelscher-Obermaier for detailed feedback on the drafts, as well as Jim Davies and Esben Kran for high level comments.
A birds eye view of the current state of model editing
Model editing, broadly speaking, is a technique which aims to modify information stored inside of a neural network. A lot of the work done thus far has been focused on editing small language models (e.g. GPT-2, GPT-J) and has been focused specifically on editing semantic facts. There also has been some work in performing edits on different types of neural networks, including vision models (Santurkar et al), CLIP (Illharco et al) and diffusion models (Orgad et al). At present, more emphasis has been placed on editing language models, so this article will be more focused on them.
One of the main approaches takes in logical triplets of the form (Subject,Relation,Object) and performs an update to the "object" value, which in turn modifies information about the "subject". For example, the sentence "The Eiffel tower is located in Paris" would be expressed as ("Eiffel tower","located","Paris"), and a potential edit could be to replace "Paris" with the value "Rome". Some variations on this setup exist (for example, editing the prediction of a [MASK] token for BERT like models), but the logical triplet setup is the most popular and will be the main approach we focus on for this article.
There are a number of different model editing techniques, which I will briefly summarize below (see Yao et al for a more in-depth overview):
1. Locate and edit methods
These methods rely on the assumption that the MLP layers of transformer models form a "linear associative memory" (Geva et al), which form a sort of database for pieces of factual information.
One way of looking at it is that there is a specific linear weight in the model that when passed a representation containing the subject (e.g. Eiffel tower), it produces an output representation which greatly increases the likelihood of the object token (e.g. Paris) being produced.
Editing with this framework involves identifying which MLP layer contains a fact you wish to update and then modifying a part of the MLP in a way which maximizes the likelihood of the object token being predicted.
Relevant works include Dong et al which updates a single neuron, Meng et al which edits a single fact on a specific layer and Meng et al which distributes multiple edits across multiple MLP layers.
2. Memory-based Model methods
Here, the original model has its weights left intact, and instead additional memory is allocated to "redirect" facts.
One example of this is Mitchell et al (SERAC), which classifies inputs to see whether to pass them to the base model or a "counterfactual model" (a model trained to produce outputs in harmony with the desired updates).
3. Meta-learning
Here, a "hyper-network" learns how to update the base language model based on desired edit. This differs from the locate and edit methods, which use a fixed mathematical update rule in computing the update weights.
An example of this is Mitchell et al where a 2-layer model is trained alongside a base model which learns how to produce low rank gradients to inject updates.
4. Distillation methods
Padmanabhan et al made use of context distillation by fine t...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners