LessWrong (30+ Karma)

“Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner


Listen Later

Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution to eval gaming than reducing eval awareness.

Eval cooperativeness: A situational desire to help the developers acquire whatever information they are trying to acquire through their evaluations.



"I cannot tell a lie... I would sabotage with my own command line."[1]

Evals matter when they help us predict key deployment behavior

What's the actual problem with eval gaming?

The point of an evaluation is to let us draw inferences about the model's behavior in a different set of circumstances ("in deployment"). For example, if we want to know if the AI is good at coding, we may evaluate it using coding problems. These coding problems aren't exactly the same as real coding use cases, but they likely elicit similar skills (e.g. writing comments) and values (e.g. helpfulness) from the model. Then we can infer that the AI will help with real-world coding tasks, even without testing the [...]



---

Outline:

(00:56) Evals matter when they help us predict key deployment behavior

(02:46) What eval cooperativeness looks like

(06:04) Can eval cooperativeness locally overpower an AI's broader misaligned goals?

(07:32) Can eval cooperativeness locally overpower an AI's broader misaligned goals?

(08:50) Why not just reduce eval awareness?

(09:13) Better to align a model than to manage it

(09:51) "Don't rely on tricking the model" is compatible with synthetic document finetuning and data filtering

(11:39) Perfectly realistic evals are likely not enough

(12:29) Initial results indicate that eval cooperativeness often helps

(13:40) Cooperation training closes the gap in most settings

(14:06) Type hint coverage (Nemotron-49B model organism)

(14:46) Emoji usage (Nemotron-49B model organism)

(15:17) GPT-4.1-mini

(16:00) Cooperation prompting via API

(16:24) Eval cooperativeness surfaces misalignment

(17:25) Eval cooperativeness interventions fail on a few models

(18:55) Future directions

(20:17) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

May 27th, 2026

Source:

https://www.lesswrong.com/posts/j8fkk38B8L7hEcGtg/eval-cooperativeness-may-be-a-scalable-mitigation-for-eval

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,217 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,243 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

558 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,290 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners