Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AXRP Episode 34 - AI Evaluations with Beth Barnes, published by DanielFilan on July 28, 2024 on The AI Alignment Forum.
YouTube link
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.
Topics we discuss:
What is METR?
What is an "eval"?
How good are evals?
Are models showing their full capabilities?
Evaluating alignment
Existential safety methodology
Threat models and capability buffers
METR's policy work
METR's relationship with labs
Related research
Roles at METR, and following METR's work
Daniel Filan: Hello everybody. In this episode I'll be speaking with Beth Barnes. Beth is the co-founder and head of research at METR. Previously, she was at OpenAI and DeepMind, doing a diverse set of things, including testing AI safety by debate and evaluating cutting-edge machine learning models. In the description, there are links to research and writings that we discussed during the episode. And if you're interested in a transcript, it's available at axrp.net. Well, welcome to AXRP.
Beth Barnes: Hey, great to be here.
What is METR?
Daniel Filan: Cool. So, in the introduction, I mentioned that you worked for Model Evaluation and Threat Research, or METR. What is METR?
Beth Barnes: Yeah, so basically, the basic mission is: have the world not be taken by surprise by dangerous AI stuff happening. So, we do threat modeling and eval creation, currently mostly around capabilities evaluation, but we're interested in whatever evaluation it is that is most load-bearing for why we think AI systems are safe. With current models, that's capabilities evaluations; in future that might be more like control or alignment evaluations.
And yeah, [the aim is to] try and do good science there, be able to recommend, "Hey, we think if you measure this, then you can rule out these things. You might be still concerned about this thing. Here's how you do this measurement properly. Here's what assumptions you need to make," this kind of thing.
Daniel Filan: Gotcha. So, mostly evaluations. But it sounded like there was some other stuff as well, like threat modeling you mentioned.
Beth Barnes: Yeah. We also do policy work recommending things in the direction of responsible scaling policies. So, saying what mitigations are needed based on the results of different evaluations and roughly how labs or governments might construct policies around this, how evals-based governance should work roughly.
Daniel Filan: Okay. So, should I think of it as roughly like: you're an evaluations org, you want to evaluate AIs, there's some amount of threat modeling which goes into "what evaluations should we even care about making?", there's some amount of policy work on the other end [about] "okay, if we do this evaluation, how should people think about that? What should people do?" And it's sort of inputs to and outputs of making of evals. Is that a fair…?
Beth Barnes: Yeah.
What is an "eval"?
Daniel Filan: Cool. So, if it centers around evals, what counts as an evaluation rather than a benchmark or some other ML technique that spits out a number at the end?
Beth Barnes: Yeah, I mean I guess the word itself isn't that important. What we're trying to do is that: we have specific threat models in mind and we're trying to construct some kind of experiment you could do, a measurement you could run, that gives you as much information as possible about that threat model or class of threat models.
Generic ML benchmarks don't necessarily have a specific goal for what you're measuring, or you might have a goal for measuring something that's more like a particular type of abstract ability or something.
Whereas we'...