Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The case for more ambitious language model evals, published by Arun Jose on January 30, 2024 on The AI Alignment Forum.
Here are some capabilities that I expect to be pretty hard to discover using an RLHF'd chat LLM:
Eric Drexler tried to use the GPT-4 base model as a writing assistant, and it [...] knew who he was from what he was writing. He tried to simulate a conversation to have the AI help him with some writing he was working on, and the AI simulacrum repeatedly insisted it was by Drexler.
A somewhat well-known Haskell programmer - let's call her Alice - wrote two draft paragraphs of a blog post she wanted to write, began prompting the base model with it, and after about two iterations it generated a link to her draft blog post repo with her name.
More generally, this is a cluster of capabilities that could be described as language models inferring a surprising amount about the data-generation process that produced its prompt, such as the identity, personality, intentions, or history of a user[1].
The reason I expect most capability evals people currently run on language models to miss out on most abilities like these is primarily that they're most naturally observed when dealing with much more open-ended contexts. For instance, continuing text as the user, predicting an assistant free to do things that could superficially look like hallucinations[2], and so on. Most evaluation mechanisms people use today involve testing the ability of fine-tuned[3] models to perform a broad array number of specified tasks in some specified contexts, with or without some scaffolding - a setting that doesn't lend itself very well toward the kind of contexts I describe above.
A pretty reasonable question to ask at this point is why it matters at all whether we can detect these capabilities. A position one could have here is that there are capabilities much more salient to various takeover scenarios that are more useful to try and detect, such as the ability to phish people, hack into secure accounts, or fine-tune other models. From that perspective, evals trying to identify capabilities like these are just far less important. Another pretty reasonable position is that these particular instances of capabilities just don't seem very impressive, and are basically what you would expect out of language models.
My response to the first would be that I think it's important to ask what we're actually trying to achieve with our model eval mechanisms. Broadly, I think there are two different (and very often overlapping) things we would want our capability evals[4] to be doing:
Understanding whether or not a specific model is possessed of some dangerous capabilities, or prone to acting in a malicious way in some context.
Giving us information to better forecast the capabilities of future models. In other words, constructing good scaling laws for our capability evals.
I'm much more excited about the latter kind of capability evals, and most of my case here is directed at that. Specifically, I think that if you want to forecast what future models will be good at, then by default you're operating in a regime where you have to account for a bunch of different emergent capabilities that don't necessarily look identical to what you've already seen.
Even if you really only care about a specific narrow band of capabilities that you expect to be very likely convergent to takeover scenarios - an expectation I don't really buy as something you can very safely assume because of the uncertainty and plurality of takeover scenarios - there is still more than one way in which you can accomplish some subtasks, some of which may only show up in more powerful models.
As a concrete example, consider the task of phishing someone on the internet. One straightforward way to achieve this would be to figure out how...