January 07, 2026

“Mainstream approach for alignment evals is a dead end” by Igor Ivanov

10 minutes

The problem of evaluation awareness

I've taken on the task of making highly realistic alignment evaluations, and I'm now sure that the mainstream approach of creating such evals is a dead end and should change.

When we run unrealistic alignment evals, models recognize that they are being evaluated. For example, when Anthropic evaluated Claude Sonnet 4.5 on their alignment tests, they found that the rate of misalignment had dropped to almost 0% compared to 10% for Claude Sonnet 4.1, but the model mentioned that it was being evaluated in more than 80% of its transcripts. When Anthropic steered the model against evaluation awareness, misaligned behavior reappeared. The evals were measuring evaluation awareness, not alignment because the model recognized that it's being evaluated.

The natural solution is to make evals more realistic: Add context, remove artificial patterns, attempt to use feedback from the evaluated model on what makes these evals look fake, but I believe that it's a fundamentally flawed approach. In this post I'll explain why this is the case, and describe some directions that might actually help.

The current paradigm and its problems

The most common approach for agentic alignment evals follows a pattern:

Give a model [...]

---

Outline:

(00:11) The problem of evaluation awareness

(01:21) The current paradigm and its problems

(03:11) Why just make it more realistic is unlikely to work

(03:33) The problem with identifying eval features

(04:52) The somethings off problem

(06:37) Capability evals are different

(07:10) Possible solutions

(07:17) Production evaluations

(08:22) Leveraging model internal values

(09:05) Modify real conversations

(09:29) Conclusion

---

First published:

January 6th, 2026

Source:

https://www.lesswrong.com/posts/GctsnCDxr73G4WiTq/mainstream-approach-for-alignment-evals-is-a-dead-end

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong

January 07, 2026

“Mainstream approach for alignment evals is a dead end” by Igor Ivanov

10 minutes

The problem of evaluation awareness

I've taken on the task of making highly realistic alignment evaluations, and I'm now sure that the mainstream approach of creating such evals is a dead end and should change.

The current paradigm and its problems

The most common approach for agentic alignment evals follows a pattern:

Give a model [...]

---

Outline:

(00:11) The problem of evaluation awareness

(01:21) The current paradigm and its problems

(03:11) Why just make it more realistic is unlikely to work

(03:33) The problem with identifying eval features

(04:52) The somethings off problem

(06:37) Capability evals are different

(07:10) Possible solutions

(07:17) Production evaluations

(08:22) Leveraging model internal values

(09:05) Modify real conversations

(09:29) Conclusion

---

First published:

January 6th, 2026

Source:

https://www.lesswrong.com/posts/GctsnCDxr73G4WiTq/mainstream-approach-for-alignment-evals-is-a-dead-end

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

113,081 Listeners

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat

7,271 Listeners

Dwarkesh Podcast

530 Listeners

The Ezra Klein Show

16,299 Listeners

AI Article Readings

4 Listeners

Doom Debates

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Mainstream approach for alignment evals is a dead end” by Igor Ivanov

Sign up to save your podcasts

“Mainstream approach for alignment evals is a dead end” by Igor Ivanov

“Mainstream approach for alignment evals is a dead end” by Igor Ivanov

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates

LessWrong posts by zvi