Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Testbed evals: evaluating AI safety even when it can't be directly measured, published by joshc on November 15, 2023 on LessWrong.
A few collaborators and I recently released the paper, "Generalization Analogies (GENIES): A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains. (tweet thread). In this post, I'll explain how the GENIES benchmark relates to a broader methodology for predicting whether AI systems are safe even when it is impossible to directly evaluate their behavior.
Summary: when AI safety is hard to measure, check whether AI alignment techniques can be used to solve easier-to-grade, analogous problems.
For example, to determine whether developers can control how honesty generalizes to superhuman domains, check whether they can control generalization across other distribution shifts like 'instructions 5th graders can evaluate' to 'instructions that PhDs can evaluate.' Or to test if developers can catch deception, check whether they can identify deliberately planted 'trojan' behaviors.
Even when the safety of a particular AI system is hard to measure, the effectiveness of AI safety researchers and their tools is often much easier to measure - just like how it's easier to measure rocket components in Aerospace testbeds like wind tunnels and pressure chambers than to measure them by launching a rocket. These 'testbed' evals will likely be an important pillar of any AI regulatory framework but have so far received little attention.
Background: why is AI safety 'hard to measure'
There are two basic reasons why it's hard to tell whether AI systems follow developer instructions.
AI behavior could look good but not actually be good. For example, it's hard to tell if superhuman AI systems obey instructions like "develop successor AIs that you are confident are safe." Humans can look at AI plans and try their best to determine whether they are reasonable, but it's hard to know if AI systems are gaming human evaluations - or worse - if they have hidden intentions and are trying to pull a fast one on us.
AI behavior cannot be observed in some environments without incurring risk. Many safety failures of frontier LLMs have been discovered after deployment, which will become obviously unacceptable after AI systems exceed some threshold of capability. Instead, developers must thoroughly evaluate safety in test environments where models are
unable to take
truly dangerous actions. It will be especially challenging to evaluate AI systems this way if they deliberately wait for opportunities to take dangerous actions or have other failure modes that don't emerge 'in the lab.'
Safety is hard to measure in other industries too
Safety seems particularly challenging to measure in AI safety, since in most industries, unsafe system aren't trying to look safe; however, there are still lessons to be gleaned from how safety is measured in other industries.
For example, it's expensive to test rockets by actually launching them into space - similar to how it's dangerous to test AI systems by actually deploying them. Aerospace Engineers perform as much testing as they can in easier-to-measure settings called 'testbeds.' For example, they build chambers that simulate the pressure and temperature conditions of empty space, construct rigs that apply strain and vibration to structural components, etc. Nuclear facility staff are evaluated with 'tabletop scenarios' to determine how they would handle disasters.
Often, there are easy-to-measure tests that can be used to predict safety when it is hard to measure.
'Testbeds' in AI Safety
Definition. I'll use the word 'testbed' to refer to a problem that is analogous to making AI systems safer but is much easier to grade. The extent to which developers can solve these problems should reflect how well they can actually make AI system...