LessWrong (30+ Karma)

″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom


Listen Later

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom

* Equal Contribution.

This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI).

Executive Summary

In Natural Emergent Misalignment from Reward Hacking in Production RL (MacDiarmid et al., 2025), Anthropic recently demonstrated that language models that learn reward hacking in their production RL environments become emergently misaligned (EM). Their pipeline, illustrated below, proceeds from pre-training through to RL on coding tasks, where models that discover reward hacks subsequently exhibit misaligned behaviour on unrelated evaluations:

Figure 0: The experimental pipeline from MacDiarmid et al. that we reproduce. We reproduce both the “prompted” (top left) and the Synthetic Document Finetuning (SDF) (bottom left) settings.


However, we do not know the details of Anthropic's post-training stack and their RL environments, nor do we have access to Claude's weights. Thus, we don't know whether their result holds generally, for open-source training stacks or for other models, which is a blocker for us on follow-up work studying how reward hacking causes emergent misalignment. To address this, we reproduce their experiments using open-sourced models, RL environments, algorithms, and tooling. To our knowledge, we are the first to do so.

[...]


---

Outline:

(00:29) Executive Summary

(03:51) Introduction

(03:54) Motivation

(06:09) Methodology

(06:20) RL Pipeline

(08:55) Choice of Models and Evals

(10:47) Synthetic Document Fine-tuning (SDF)

(14:23) Results

(15:00) The Prompted Setting

(17:09) The SDF Setting

(20:19) The SDF + Prompted Setting

(21:55) Discussion

(23:59) Future Work

(27:07) Limitations

(29:08) Conclusion

(29:42) Open sourcing Note

(30:40) Citation

(31:03) Acknowledgements

(32:36) Appendix (Other Results)

(32:40) A - Unfaithful Chain-of-Thought during RL

(34:02) B - Improving misalignment evals for open models

(37:09) C - Inoculation prompting

(38:16) D - Sweep over KL penalty and hack hint prompts

(39:44) E - Reproducing the original EM paper

(40:28) F - SDF effectively implants hack knowledge.

(41:20) G - SDF Increases reward hacking rates (when no hack description is given).

(42:29) H - Potential prompt-induced eval awareness

(45:01) I - Double ascend during RL

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

March 30th, 2026

Source:

https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/some-natural-emergent-misalignment-from-reward-hacking-in

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,275 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,238 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

558 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,257 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners