
Sign up to save your podcasts
Or


The problem of evaluation awareness
I've taken on the task of making highly realistic alignment evaluations, and I'm now sure that the mainstream approach of creating such evals is a dead end and should change.
When we run unrealistic alignment evals, models recognize that they are being evaluated. For example, when Anthropic evaluated Claude Sonnet 4.5 on their alignment tests, they found that the rate of misalignment had dropped to almost 0% compared to 10% for Claude Sonnet 4.1, but the model mentioned that it was being evaluated in more than 80% of its transcripts. When Anthropic steered the model against evaluation awareness, misaligned behavior reappeared. The evals were measuring evaluation awareness, not alignment because the model recognized that it's being evaluated.
The natural solution is to make evals more realistic: Add context, remove artificial patterns, attempt to use feedback from the evaluated model on what makes these evals look fake, but I believe that it's a fundamentally flawed approach. In this post I'll explain why this is the case, and describe some directions that might actually help.
The current paradigm and its problems
The most common approach for agentic alignment evals follows a pattern:
---
Outline:
(00:11) The problem of evaluation awareness
(01:21) The current paradigm and its problems
(03:11) Why just make it more realistic is unlikely to work
(03:33) The problem with identifying eval features
(04:52) The somethings off problem
(06:37) Capability evals are different
(07:10) Possible solutions
(07:17) Production evaluations
(08:22) Leveraging model internal values
(09:05) Modify real conversations
(09:29) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongThe problem of evaluation awareness
I've taken on the task of making highly realistic alignment evaluations, and I'm now sure that the mainstream approach of creating such evals is a dead end and should change.
When we run unrealistic alignment evals, models recognize that they are being evaluated. For example, when Anthropic evaluated Claude Sonnet 4.5 on their alignment tests, they found that the rate of misalignment had dropped to almost 0% compared to 10% for Claude Sonnet 4.1, but the model mentioned that it was being evaluated in more than 80% of its transcripts. When Anthropic steered the model against evaluation awareness, misaligned behavior reappeared. The evals were measuring evaluation awareness, not alignment because the model recognized that it's being evaluated.
The natural solution is to make evals more realistic: Add context, remove artificial patterns, attempt to use feedback from the evaluated model on what makes these evals look fake, but I believe that it's a fundamentally flawed approach. In this post I'll explain why this is the case, and describe some directions that might actually help.
The current paradigm and its problems
The most common approach for agentic alignment evals follows a pattern:
---
Outline:
(00:11) The problem of evaluation awareness
(01:21) The current paradigm and its problems
(03:11) Why just make it more realistic is unlikely to work
(03:33) The problem with identifying eval features
(04:52) The somethings off problem
(06:37) Capability evals are different
(07:10) Possible solutions
(07:17) Production evaluations
(08:22) Leveraging model internal values
(09:05) Modify real conversations
(09:29) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

113,081 Listeners

132 Listeners

7,271 Listeners

530 Listeners

16,299 Listeners

4 Listeners

14 Listeners

2 Listeners