October 30, 2025

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

30 minutes

📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism

TL, DR:;

We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested.
- The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written.
- Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints.
- The resulting model uses types hints more during evaluation than in deployment.
- Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this.
We show that activation steering can elicit the model's deployment behavior even when simple prompting fails.
- When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...]

---

Outline:

(02:45) Extended Summary

(09:18) Commentary about the paper from Tim:

(09:52) Thoughts on how to steer models to believe they are deployed

(10:06) Contrastive prompts are good actually

(12:10) Steer on multiple layers at once

(13:09) Consider extracting the steering vector from an earlier checkpoint as well

(14:54) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)

(15:24) Creating behavioral demonstrations of misalignment based on the steered model.

(16:46) Other things to do with a model that's been steered to believe it's deployed

(17:31) How others can build on this work

(18:07) Self-Review of the Paper's Strengths and Weaknesses:

(18:19) Main strengths of the paper:

(18:28) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.

(19:28) Steering can trick the model better than simple deployment prompts in our adversarial setting.

(20:47) Most sketchy parts of the paper:

(20:58) Llama Nemotron 49B is gullible and easy to trick.

(23:12) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.

(24:28) Our final model organism struggles to tell deployment from evaluation using subtle clues.

(25:03) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.

(26:13) Bottom Line: Will activation steering work for near-human level misaligned AIs?

(27:26) Other miscellaneous takes on evaluation awareness:

(28:44) Other miscellaneous takes on conducting this type of research

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

October 30th, 2025

Source:

https://www.lesswrong.com/posts/peKrvZ6t9PSCzoQDa/steering-evaluation-aware-models-to-act-like-they-are

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

October 30, 2025

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

30 minutes

📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism

TL, DR:;

We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested.
- The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written.
- Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints.
- The resulting model uses types hints more during evaluation than in deployment.
- Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this.
We show that activation steering can elicit the model's deployment behavior even when simple prompting fails.
- When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...]

---

Outline:

(02:45) Extended Summary

(09:18) Commentary about the paper from Tim:

(09:52) Thoughts on how to steer models to believe they are deployed

(10:06) Contrastive prompts are good actually

(12:10) Steer on multiple layers at once

(13:09) Consider extracting the steering vector from an earlier checkpoint as well

(14:54) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)

(15:24) Creating behavioral demonstrations of misalignment based on the steered model.

(16:46) Other things to do with a model that's been steered to believe it's deployed

(17:31) How others can build on this work

(18:07) Self-Review of the Paper's Strengths and Weaknesses:

(18:19) Main strengths of the paper:

(18:28) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.

(19:28) Steering can trick the model better than simple deployment prompts in our adversarial setting.

(20:47) Most sketchy parts of the paper:

(20:58) Llama Nemotron 49B is gullible and easy to trick.

(23:12) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.

(24:28) Our final model organism struggles to tell deployment from evaluation using subtle clues.

(25:03) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.

(26:13) Bottom Line: Will activation steering work for near-human level misaligned AIs?

(27:26) Other miscellaneous takes on evaluation awareness:

(28:44) Other miscellaneous takes on conducting this type of research

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

October 30th, 2025

Source:

https://www.lesswrong.com/posts/peKrvZ6t9PSCzoQDa/steering-evaluation-aware-models-to-act-like-they-are

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

Making Sense with Sam Harris

26,330 Listeners

Conversations with Tyler

2,453 Listeners

The Peter Attia Drive

8,557 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,182 Listeners

ManifoldOne

93 Listeners

Your Undivided Attention

1,601 Listeners

All-In with Chamath, Jason, Sacks & Friedberg

9,927 Listeners

Machine Learning Street Talk (MLST)

95 Listeners

Dwarkesh Podcast

511 Listeners

Hard Fork

5,512 Listeners

The Ezra Klein Show

15,931 Listeners

Moonshots with Peter Diamandis

545 Listeners

No Priors: Artificial Intelligence | Technology | Startups

131 Listeners

Latent Space: The AI Engineer Podcast

94 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

467 Listeners

Share “Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

Sign up to save your podcasts

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

More shows like LessWrong (30+ Karma)

Making Sense with Sam Harris

Conversations with Tyler

The Peter Attia Drive

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

ManifoldOne

Your Undivided Attention

All-In with Chamath, Jason, Sacks & Friedberg

Machine Learning Street Talk (MLST)

Dwarkesh Podcast

Hard Fork

The Ezra Klein Show

Moonshots with Peter Diamandis

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

BG2Pod with Brad Gerstner and Bill Gurley