LessWrong (30+ Karma)

“Mainstream approach for alignment evals is a dead end” by Igor Ivanov


Listen Later

The problem of evaluation awareness

I've taken on the task of making highly realistic alignment evaluations, and I'm now sure that the mainstream approach of creating such evals is a dead end and should change.

When we run unrealistic alignment evals, models recognize that they are being evaluated. For example, when Anthropic evaluated Claude Sonnet 4.5 on their alignment tests, they found that the rate of misalignment had dropped to almost 0% compared to 10% for Claude Sonnet 4.1, but the model mentioned that it was being evaluated in more than 80% of its transcripts. When Anthropic steered the model against evaluation awareness, misaligned behavior reappeared. The evals were measuring evaluation awareness, not alignment because the model recognized that it's being evaluated.

The natural solution is to make evals more realistic: Add context, remove artificial patterns, attempt to use feedback from the evaluated model on what makes these evals look fake, but I believe that it's a fundamentally flawed approach. In this post I'll explain why this is the case, and describe some directions that might actually help.

The current paradigm and its problems

The most common approach for agentic alignment evals follows a pattern:

  1. Give a model [...]

---

Outline:

(00:11) The problem of evaluation awareness

(01:21) The current paradigm and its problems

(03:11) Why just make it more realistic is unlikely to work

(03:33) The problem with identifying eval features

(04:52) The somethings off problem

(06:37) Capability evals are different

(07:10) Possible solutions

(07:17) Production evaluations

(08:22) Leveraging model internal values

(09:05) Modify real conversations

(09:29) Conclusion

---

First published:

January 6th, 2026

Source:

https://www.lesswrong.com/posts/GctsnCDxr73G4WiTq/mainstream-approach-for-alignment-evals-is-a-dead-end

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,081 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,271 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

530 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,299 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners