LessWrong (30+ Karma)

[Linkpost] “Stress Testing Deliberative Alignment for Anti-Scheming Training” by Mikita Balesni


Listen Later

This is a link post.

Twitter | Microsite | Apollo Blog | OpenAI Blog | Full paper

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of covert behavior.

Part of this reduction is likely caused by increased situational [...]

---

First published:

September 17th, 2025

Source:

https://www.lesswrong.com/posts/JmRfgNYCrYogCq7ny/stress-testing-deliberative-alignment-for-anti-scheming

Linkpost URL:
https://antischeming.ai

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,882 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,216 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

533 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,223 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners