Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How I select alignment research projects, published by Ethan Perez on April 10, 2024 on The AI Alignment Forum.
Youtube Video
Recently, I was interviewed by Henry Sleight and Mikita Balesni about how I select alignment research projects. Below is the slightly cleaned up transcript for the YouTube video.
Introductions
Henry Sleight: How about you two introduce yourselves?
Ethan Perez: I'm Ethan. I'm a researcher at Anthropic and do a lot of external collaborations with other people, via the Astra Fellowship and SERI MATS. Currently my team is working on adversarial robustness, and we recently did the sleeper agents paper. So, basically looking at we can use RLHF or adversarial training or current state-of-the-art alignment safety training techniques to train away bad behavior.
And we found that in some cases, the answer is no: that they don't train away hidden goals or backdoor behavior and models. That was a lot of my focus in the past, six to twelve months.
Mikita Balesni: Hey, I'm Mikita. I work at Apollo. I'm a researcher doing evals for scheming. So trying to look for whether models can plan to do something bad later. Right now, I'm in Constellation for a month where I'm trying to collaborate with others to come up with ideas for next projects and what Apollo should do.
Henry Sleight: I'm Henry. I guess in theory I'm the glue between you two, but you also already know each other, so this is in some ways pointless. But I'm one of Ethan's Astra fellows working on adversarial robustness. Currently, our project is trying to come up with a good fine-tuning recipe for robustness. Currently working on API models for a sprint, then we'll move onto open models probably.
How Ethan Selects Research Projects
Henry Sleight: So I guess the topic for us to talk about today, that we've agreed on beforehand, is "how to select what research project you do?" What are the considerations, what does that process look like? And the rough remit of this conversation is that Ethan and Mikita presumably have good knowledge transfer to be doing, and I hope to make that go better. Great. Let's go. Mikita, where do you want to start?
Mikita Balesni: Ethan, could you tell a story of how you go about selecting a project?
Top-down vs Bottom-up Approach
Ethan Perez: In general, I think there's two modes for how I pick projects. So one would be thinking about a problem that I want to solve and then thinking about an approach that would make progress on the problem. So that's top down approach, and then there's a bottom up approach, which is [thinking]: "seems like this technique or this thing is working, or there's something interesting here." And then following my nose on that.
That's a bit results driven, where it seems like: I think a thing might work, I have some high-level idea of how it relates to the top-level motivation, but haven't super fleshed it out. But it seems like there's a lot of low-hanging fruit to pick. And then just pushing that and then maybe in parallel or after thinking through "what problem is this going to be useful for?"
Mikita Balesni: So at what point do you think about the theory of change for this?
Ethan Perez: For some projects it will be..., I think just through, during the project. I mean, often the iteration cycle is, within the course of a day or something. So it's not, it's not necessarily that if it ends up that the direction isn't that important, that it was a huge loss or sometimes it's a couple of hours or just a few messages to a model. Sometimes it's just helpful to have some empirical evidence to guide a conversation.
If you're trying to pick someone's brain about what's the importance of this direction. You might think that it's difficult in some ways to evaluate whether models know they're being trained or tested, which is pretty relevant for various a...