May 27, 2026

“Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner

21 minutes

Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution to eval gaming than reducing eval awareness.

Eval cooperativeness: A situational desire to help the developers acquire whatever information they are trying to acquire through their evaluations.

"I cannot tell a lie... I would sabotage with my own command line."[1]

Evals matter when they help us predict key deployment behavior

What's the actual problem with eval gaming?

The point of an evaluation is to let us draw inferences about the model's behavior in a different set of circumstances ("in deployment"). For example, if we want to know if the AI is good at coding, we may evaluate it using coding problems. These coding problems aren't exactly the same as real coding use cases, but they likely elicit similar skills (e.g. writing comments) and values (e.g. helpfulness) from the model. Then we can infer that the AI will help with real-world coding tasks, even without testing the [...]

---

Outline:

(00:56) Evals matter when they help us predict key deployment behavior

(02:46) What eval cooperativeness looks like

(06:04) Can eval cooperativeness locally overpower an AI's broader misaligned goals?

(07:32) Can eval cooperativeness locally overpower an AI's broader misaligned goals?

(08:50) Why not just reduce eval awareness?

(09:13) Better to align a model than to manage it

(09:51) "Don't rely on tricking the model" is compatible with synthetic document finetuning and data filtering

(11:39) Perfectly realistic evals are likely not enough

(12:29) Initial results indicate that eval cooperativeness often helps

(13:40) Cooperation training closes the gap in most settings

(14:06) Type hint coverage (Nemotron-49B model organism)

(14:46) Emoji usage (Nemotron-49B model organism)

(15:17) GPT-4.1-mini

(16:00) Cooperation prompting via API

(16:24) Eval cooperativeness surfaces misalignment

(17:25) Eval cooperativeness interventions fail on a few models

(18:55) Future directions

(20:17) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

May 27th, 2026

Source:

https://www.lesswrong.com/posts/j8fkk38B8L7hEcGtg/eval-cooperativeness-may-be-a-scalable-mitigation-for-eval

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

May 27, 2026

“Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner

21 minutes

Eval cooperativeness: A situational desire to help the developers acquire whatever information they are trying to acquire through their evaluations.

"I cannot tell a lie... I would sabotage with my own command line."[1]

Evals matter when they help us predict key deployment behavior

What's the actual problem with eval gaming?

---

Outline:

(00:56) Evals matter when they help us predict key deployment behavior

(02:46) What eval cooperativeness looks like

(06:04) Can eval cooperativeness locally overpower an AI's broader misaligned goals?

(07:32) Can eval cooperativeness locally overpower an AI's broader misaligned goals?

(08:50) Why not just reduce eval awareness?

(09:13) Better to align a model than to manage it

(09:51) "Don't rely on tricking the model" is compatible with synthetic document finetuning and data filtering

(11:39) Perfectly realistic evals are likely not enough

(12:29) Initial results indicate that eval cooperativeness often helps

(13:40) Cooperation training closes the gap in most settings

(14:06) Type hint coverage (Nemotron-49B model organism)

(14:46) Emoji usage (Nemotron-49B model organism)

(15:17) GPT-4.1-mini

(16:00) Cooperation prompting via API

(16:24) Eval cooperativeness surfaces misalignment

(17:25) Eval cooperativeness interventions fail on a few models

(18:55) Future directions

(20:17) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

May 27th, 2026

Source:

https://www.lesswrong.com/posts/j8fkk38B8L7hEcGtg/eval-cooperativeness-may-be-a-scalable-mitigation-for-eval

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,217 Listeners

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat

7,243 Listeners

Dwarkesh Podcast

558 Listeners

The Ezra Klein Show

16,290 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner

Sign up to save your podcasts

“Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner

“Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming” by Jasmine Li, Alex Turner

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi