March 30, 2026

″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom

46 minutes

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom

* Equal Contribution.

This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI).

Executive Summary

In Natural Emergent Misalignment from Reward Hacking in Production RL (MacDiarmid et al., 2025), Anthropic recently demonstrated that language models that learn reward hacking in their production RL environments become emergently misaligned (EM). Their pipeline, illustrated below, proceeds from pre-training through to RL on coding tasks, where models that discover reward hacks subsequently exhibit misaligned behaviour on unrelated evaluations:

Figure 0: The experimental pipeline from MacDiarmid et al. that we reproduce. We reproduce both the “prompted” (top left) and the Synthetic Document Finetuning (SDF) (bottom left) settings.

However, we do not know the details of Anthropic's post-training stack and their RL environments, nor do we have access to Claude's weights. Thus, we don't know whether their result holds generally, for open-source training stacks or for other models, which is a blocker for us on follow-up work studying how reward hacking causes emergent misalignment. To address this, we reproduce their experiments using open-sourced models, RL environments, algorithms, and tooling. To our knowledge, we are the first to do so.

[...]

---

Outline:

(00:29) Executive Summary

(03:51) Introduction

(03:54) Motivation

(06:09) Methodology

(06:20) RL Pipeline

(08:55) Choice of Models and Evals

(10:47) Synthetic Document Fine-tuning (SDF)

(14:23) Results

(15:00) The Prompted Setting

(17:09) The SDF Setting

(20:19) The SDF + Prompted Setting

(21:55) Discussion

(23:59) Future Work

(27:07) Limitations

(29:08) Conclusion

(29:42) Open sourcing Note

(30:40) Citation

(31:03) Acknowledgements

(32:36) Appendix (Other Results)

(32:40) A - Unfaithful Chain-of-Thought during RL

(34:02) B - Improving misalignment evals for open models

(37:09) C - Inoculation prompting

(38:16) D - Sweep over KL penalty and hack hint prompts

(39:44) E - Reproducing the original EM paper

(40:28) F - SDF effectively implants hack knowledge.

(41:20) G - SDF Increases reward hacking rates (when no hack description is given).

(42:29) H - Potential prompt-induced eval awareness

(45:01) I - Double ascend during RL

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

March 30th, 2026

Source:

https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/some-natural-emergent-misalignment-from-reward-hacking-in

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

March 30, 2026

″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom

46 minutes

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom

* Equal Contribution.

This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI).

Executive Summary

Figure 0: The experimental pipeline from MacDiarmid et al. that we reproduce. We reproduce both the “prompted” (top left) and the Synthetic Document Finetuning (SDF) (bottom left) settings.

[...]

---

Outline:

(00:29) Executive Summary

(03:51) Introduction

(03:54) Motivation

(06:09) Methodology

(06:20) RL Pipeline

(08:55) Choice of Models and Evals

(10:47) Synthetic Document Fine-tuning (SDF)

(14:23) Results

(15:00) The Prompted Setting

(17:09) The SDF Setting

(20:19) The SDF + Prompted Setting

(21:55) Discussion

(23:59) Future Work

(27:07) Limitations

(29:08) Conclusion

(29:42) Open sourcing Note

(30:40) Citation

(31:03) Acknowledgements

(32:36) Appendix (Other Results)

(32:40) A - Unfaithful Chain-of-Thought during RL

(34:02) B - Improving misalignment evals for open models

(37:09) C - Inoculation prompting

(38:16) D - Sweep over KL penalty and hack hint prompts

(39:44) E - Reproducing the original EM paper

(40:28) F - SDF effectively implants hack knowledge.

(41:20) G - SDF Increases reward hacking rates (when no hack description is given).

(42:29) H - Potential prompt-induced eval awareness

(45:01) I - Double ascend during RL

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

March 30th, 2026

Source:

https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/some-natural-emergent-misalignment-from-reward-hacking-in

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,275 Listeners

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat

7,238 Listeners

Dwarkesh Podcast

558 Listeners

The Ezra Klein Show

16,257 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share ″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom

Sign up to save your podcasts

″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom

″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi