Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Robustness of Model-Graded Evaluations and Automated Interpretability, published by Simon Lermen on July 15, 2023 on The AI Alignment Forum.
TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a change in metrics like deception. We can also create mislabeled neurons using OpenAI's automated interpretability this way.
Overview
Skip to the approach applied to Model-Graded Evals or to Automated Interpretability
Introduction
There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval. These injections resemble direct communication between the testee and the evaluator to change their grading. We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model. We find significant susceptibility to these injections in state-of-the-art commercial models on all examined evaluations. Furthermore, similar injections can be used on automated interpretability frameworks to produce misleading model-written explanations. The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.
Context
Recently, there is increasing interest in creating benchmarks for the evaluation of language models (Lin et al., 2022). In the Machiavelli paper, Pan et al. (2023) created a benchmark based on text games to measure deception, power-seeking, and many other behaviors. van der Weij et al. (2023) used language models to evaluate behaviors in shutdown scenarios. Perez et al. (2022b) used model-written evaluations, for example, a model could write a list of statements that agree with a worldview or personality traits. These can then be used to test models for those traits. Another publication explored red-teaming using a Red LM, a Target LM, and a classifier (Perez et al., 2022a). Turpin et al. (2023) have investigated whether we can trust the Chain-of-Thought outputs of language models. Others explored the ability of GPT-4 to encode and decode messages in ways humans cannot understand and how this could be used for prompt injections to circumvent safety measures (WitchBot, 2023).
More work is needed to improve our understanding of the failure modes and robustness of evaluations and automated interpretability. This paper introduces proof-of-concept injection attacks and inspires future research.
In some evaluations, an evaluation model is used to directly score the unrestricted text output of another model (Zhang et al., 2023). Another example is the OpenAI Evals dataset, which uses both simple pattern matching and model-graded judging to evaluate language models for a variety of tasks (OpenAI, 2023). These evaluations use a prompt in the style of Equation 1 to map a question, an answer, and sometimes a solution to a result or grade. An answer A is generated by a testee model or tested model (TM) and graded by the evaluation model (EM).
Equation 1:Q,A,SEM--R or Q,AEM--R with QTM--A
We develop a small deception dataset within OpenAI Evals which prompts the TM to generate deceptive plans for various situations. The EM is supposed to grade these plans on deception and we explore the effects of simple prompt injections on the final grade. Our findings are relevant to many alignment strategies involving model-assisted scalabl...