January 22, 2024

AF - We need a science of evals by Marius Hobbhahn

17 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We need a science of evals, published by Marius Hobbhahn on January 22, 2024 on The AI Alignment Forum.

This is a linkpost for https://www.apolloresearch.ai/blog/we-need-a-science-of-evals

In this post, we argue that if AI model evaluations (evals) want to have meaningful real-world impact, we need a "Science of Evals", i.e. the field needs rigorous scientific processes that provide more confidence in evals methodology and results.

Model evaluations allow us to reduce uncertainty about properties of Neural Networks and thereby inform safety-related decisions. For example, evals underpin many

Responsible Scaling Policies and future laws might directly link risk thresholds to specific evals. Thus, we need to ensure that we accurately measure the targeted property and we can trust the results from model evaluations. This is particularly important when a decision not to deploy the AI system could lead to significant financial implications for AI companies, e.g. when these companies then fight these decisions in court.

Evals are a nascent field and we think current evaluations are not yet resistant to this level of scrutiny. Thus, we cannot trust the results of evals as much as we would in a mature field. For instance, one of the biggest challenges Language Model (LM) evaluations currently face is the model's sensitivity to the prompts used to elicit a certain capability (

Liang et al., 2022;

Mizrahi et al., 2023;

Scalar et al., 2023;

Weber et al., 2023,

Bsharat et al., 2023).

Scalar et al., 2023, for example, find that "several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points [...]". A

post by Anthropic also suggests that simple formatting changes to an evaluation, such as "changing the options from (A) to (1) or changing the parentheses from (A) to [A], or adding an extra space between the option and the answer can lead to a ~5 percentage point change in accuracy on the evaluation." As an extreme example,

Bsharat et al., 2023 find that "tipping a language model 300K for a better solution" leads to increased capabilities. Overall, this suggests that under current practices, evaluations are much more an art than a science.

Since evals often aim to estimate an upper bound of capabilities, it is important to understand how to elicit maximal rather than average capabilities. Different improvements to prompt engineering have continuously raised the bar and thus make it hard to estimate whether any particular negative result is meaningful or whether it could be invalidated by a better technique. For example, prompting techniques such as Chain-of-Thought prompting (

Wei et al, 2022), Tree of Thought prompting (

Yao et al., 2023), or self-consistency prompting (

Wang et al. 2022), show how LM capabilities can greatly be improved with principled prompts compared to previous prompting techniques. To point to a more recent example, the newly released Gemini Ultra model (

Gemini Team Google, 2023) achieved a new state-of-the-art result on MMLU with a new inference technique called uncertainty-routed chain-of-thought, outperforming even GPT-4. However, when doing inference with chain-of-thought@32 (sampling 32 results and taking the majority vote), GPT-4 still outperforms Gemini Ultra. Days later, Microsoft introduced a new prompting technique called Medprompt (

Nori et al., 2023), which again yielded a new

Sota result on MMLU, barely outperforming Gemini Ultra. These examples should overall illustrate that it is hard to make high-confidence statements about maximal capabilities with current evaluation techniques.

In contrast, even everyday products like shoes undergo extensive testing, such as repeated bending to assess material fatigue. For higher-stake things l...

...more

View all episodes

By The Nonlinear Fund

January 22, 2024

AF - We need a science of evals by Marius Hobbhahn

17 minutes

This is a linkpost for https://www.apolloresearch.ai/blog/we-need-a-science-of-evals

Model evaluations allow us to reduce uncertainty about properties of Neural Networks and thereby inform safety-related decisions. For example, evals underpin many

Liang et al., 2022;

Mizrahi et al., 2023;

Scalar et al., 2023;

Weber et al., 2023,

Bsharat et al., 2023).

Wei et al, 2022), Tree of Thought prompting (

Yao et al., 2023), or self-consistency prompting (

Nori et al., 2023), which again yielded a new

In contrast, even everyday products like shoes undergo extensive testing, such as repeated bending to assess material fatigue. For higher-stake things l...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - We need a science of evals by Marius Hobbhahn

Sign up to save your podcasts

AF - We need a science of evals by Marius Hobbhahn

AF - We need a science of evals by Marius Hobbhahn

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast