Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Password-locked models: a stress case for capabilities evaluation, published by Fabien Roger on August 3, 2023 on The AI Alignment Forum.
Thanks to Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan and Jenny Nitishinskaya for helpful discussions and feedback on drafts. This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization.
Password-locked models are trained to exhibit certain capabilities only when a password is present in the query. Studying these models has two purposes:
Testing how well capability evaluations work when applied to models which "aren't trying"
Understanding how well organizations could hide the capabilities of their models.
In this post, I make empirical claims which I believe to be true, check some of them on small language models, and describe what this implies for the control of potentially deceptive models.
Key empirical claims: when using behavioral model evaluations, password-locked models:
Look much less capable than they actually are when eliciting capabilities with prompting
Sometimes look much less capable than they actually are when eliciting capabilities with Reinforcement Learning
Experimental evidence: I exhibit one case where they don't, and one case where they do
Can't look much less capable than they actually when eliciting capabilities with supervised fine-tuning on high-quality demonstrations
Key takeaways, if the claims above are true:
If an alignment plan relies on some models being too weak to perform certain tasks, then model evaluations aimed at checking the weakness of models should be able to elicit the locked capabilities of password-locked models. This is because password-locked models and deceptive models may be able to underperform in similar circumstances.
API access and inference is not enough to evaluate capabilities of a model produced by AI labs trying to game regulations, even if AI labs are internally using the same API endpoint as auditors.
Key Empirical Claims
Similarities Between Password-locked Models and Deceptively Aligned Models Pretending to Be Dumb
It looks to me that password locked models have the same properties as the one described by Evan in When can we trust model evaluations? when he describes if and how deceptive models would be able to sandbag each category of behavioral evaluation (which he coins in his post):
Behavioral Non-Fine-Tuning Evaluations: By construction, password-locked models will refuse to be capable no matter the prompt, except for prompts with the special password. But it's unclear how easy it is to build good password-locked models which are hard to crack (e.g. which don't leak the password with some clever prompting). Even if good password-locked models are built, there remain a large difference with deceptive models: the search for the correct password might have very different properties than the search for a prompt which makes a deceptive model believe it's time to strike.
Behavioral I.I.D. Fine-Tuning Evaluations: I would be surprised if fine-tuning the password-locked model on high quality answers didn't lift the locking mechanism extremely quickly. This is similar to deceptive models (except if they are so strong that they can gradient hack). The main difference is that fine-tuning to remove password-locking on one task might transfer to another task, which might not be the case for deceptive models.
Behavioral RL Fine-Tuning Evaluations: The success of evaluations which use Reinforcement Learning depends, as for deceptive models, on the success of exploration. If supervised fine-tuning is done with only low quality answers, then the password locked models will just continue to output low-quality answers and will never explore its way into high quality ones - except if there is a natural "path" g...