LessWrong (30+ Karma)

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda


Listen Later

📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism

TL, DR:;

  • We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested.
    • The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written.
    • Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints.
    • The resulting model uses types hints more during evaluation than in deployment.
    • Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this.
  • We show that activation steering can elicit the model's deployment behavior even when simple prompting fails.
    • When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...]

---

Outline:

(02:45) Extended Summary

(09:18) Commentary about the paper from Tim:

(09:52) Thoughts on how to steer models to believe they are deployed

(10:06) Contrastive prompts are good actually

(12:10) Steer on multiple layers at once

(13:09) Consider extracting the steering vector from an earlier checkpoint as well

(14:54) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)

(15:24) Creating behavioral demonstrations of misalignment based on the steered model.

(16:46) Other things to do with a model that's been steered to believe it's deployed

(17:31) How others can build on this work

(18:07) Self-Review of the Paper's Strengths and Weaknesses:

(18:19) Main strengths of the paper:

(18:28) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.

(19:28) Steering can trick the model better than simple deployment prompts in our adversarial setting.

(20:47) Most sketchy parts of the paper:

(20:58) Llama Nemotron 49B is gullible and easy to trick.

(23:12) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.

(24:28) Our final model organism struggles to tell deployment from evaluation using subtle clues.

(25:03) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.

(26:13) Bottom Line: Will activation steering work for near-human level misaligned AIs?

(27:26) Other miscellaneous takes on evaluation awareness:

(28:44) Other miscellaneous takes on conducting this type of research

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

October 30th, 2025

Source:

https://www.lesswrong.com/posts/peKrvZ6t9PSCzoQDa/steering-evaluation-aware-models-to-act-like-they-are

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,330 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,453 Listeners

The Peter Attia Drive by Peter Attia, MD

The Peter Attia Drive

8,557 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,182 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

93 Listeners

Your Undivided Attention by The Center for Humane Technology, Tristan Harris, Daniel Barcay and Aza Raskin

Your Undivided Attention

1,601 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

9,927 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

95 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

511 Listeners

Hard Fork by The New York Times

Hard Fork

5,512 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

15,931 Listeners

Moonshots with Peter Diamandis by PHD Ventures

Moonshots with Peter Diamandis

545 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

131 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

94 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

467 Listeners