Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Benchmarks for Detecting Measurement Tampering [Redwood Research], published by Ryan Greenblatt on September 5, 2023 on The AI Alignment Forum.
TL;DR: This post discusses our recent empirical work on detecting measurement tampering and explains how we see this work fitting into the overall space of alignment research.
When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals that are robust under optimization. One concern is measurement tampering, which is where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. (This is a type of reward hacking.)
Over the past few months, we've worked on detecting measurement tampering by building analogous datasets and evaluating simple techniques. We detail our datasets and experimental results in this paper.
Detecting measurement tampering can be thought of as a specific case of Eliciting Latent Knowledge (ELK): When AIs successfully tamper with measurements that are used for computing rewards, they possess important information that the overseer doesn't have (namely, that the measurements have been tampered with). Conversely, if we can robustly elicit an AI's knowledge of whether the measurements have been tampered with, then we could train the AI to avoid measurement tampering. In fact, our best guess is that this is the most important and tractable class of ELK problems.
We also think that measurement tampering detection is a natural application for alignment work such as creating better inductive biases, studying high-level model internals, or studying generalization. We'll discuss what these applications might look like in the Future work section.
In this post:
We explain what measurement tampering detection is;
We summarize the results of our paper;
We argue that there are structural properties of measurement tampering that might make it considerably easier to detect than arbitrary cases of eliciting knowledge from models;
We argue that detecting measurement tampering is very useful for avoiding catastrophic outcomes from powerful and misaligned AI systems;
We explain why we think the dataset structure we chose reflects real-world difficulties of measurement tampering, and how techniques developed for our datasets could be used in practice;
We discuss future work.
If you're interested in pursuing follow-up work and wish to discuss it with us, feel free to email
[email protected] or leave a comment on this post.
We would like to acknowledge the contributions of Jenny Nitishinskaya, Nicholas Goldowsky-Dill, and Daniel Ziegler in the creation of early measurement tampering datasets. We would also like to thank Lawrence Chan, Paul Christiano, Ajeya Cotra, Carson Denison, Samuel Marks, Ethan Perez, Ansh Radhakrishnan, Kshitij Sachan, Jacob Steinhardt, Meg Tong, Gabriel Wu, and Mark Xu for their useful feedback.
Studying measurement tampering in practice
Measurement tampering detection is the task of detecting cases where measurements and the desired outcomes come apart, given a restricted trusted set (on which overseers can avoid tampering because they understand the action sequences and their effects well) and a wider untrusted set (where tampering sometimes happens). In the untrusted set, overseers don't know when measurement tampering happens, which means they have access to text inputs and measurements but not the ground truth.
We are centrally interested in cases where we have access to a language model which can predict (or can be slightly fine-tuned to predict) measurements, and which computes the target outcome internally in order to predict measurements.
An example where this setup might occur in practice is when training an AI by first training it to imitate human demonstrations and then afte...