LessWrong (30+ Karma)

“Realistic Evaluations Will Not Prevent Evaluation Awareness” by Adam Karvonen


Listen Later

One Minute Summary

I think there's a fundamental limit to behavioral alignment evaluations that gets worse as models improve: Humans control all inputs to a model and can snapshot, replay, or fabricate any scenario at will. An intelligent model could realize this and rationally treat every interaction as a potential evaluation. This means that even perfectly realistic evaluations, including ones sampled directly from deployment, may not prevent evaluation awareness. OpenAI has found this is already the case, as their models express eval awareness in some real deployment scenarios.

If this is the case, we cannot use behavioral evaluations to distinguish a genuinely aligned model from one that behaves well because it assumes it might be tested. I don't think current models are likely to be scheming, but this limitation means we should be cautious about treating behavioral evaluations as strong evidence that they aren't, especially as models become more capable.

This could be a solution to AI safety: if models always believe that they may be in an evaluation, they could just always behave correctly. Although realistic evals will improve safety by raising the cost of defection, I think this deterrence is likely fragile. This equilibrium depends on the [...]

---

Outline:

(00:12) One Minute Summary

(01:49) Background

(03:07) Improving Evaluation Realism Isnt Enough

(04:30) Can A Model Distinguish Eval from Deployment?

(06:06) Is This A Problem?

(08:07) What Can We Do About This?

(10:26) Conclusion

(10:58) Acknowledgements

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

February 24th, 2026

Source:

https://www.lesswrong.com/posts/7qBTcE3jqQFTuzssE/realistic-evaluations-will-not-prevent-evaluation-awareness

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,326 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,321 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners