LessWrong (30+ Karma)

“Harmless reward hacks can generalize to misalignment in LLMs” by Mia Taylor, Owain_Evans


Listen Later

This post shows the abstract, introduction, and main figures from our new paper "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs".

TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent misalignment.

Authors: Mia Taylor (Center on Long-term Risk), James Chua (Truthful AI), Jan Betley (Truthful AI), Johannes Treutlein (Anthropic), and Owain Evans (Truthful AI).

Twitter thread | Full paper | Dataset

Figure 1: Reward hackers generalize to other forms of misalignment. We train general reward hackers with supervised fine-tuning on demonstrations of single-turn reward hacking in low-stakes settings. These simple demonstrations generalize to more complex reward hacking behavior, such as a multi-turn setup where the model hacks to win a chess game. Interestingly, we also observe harmful misaligned answers, including instances where the model discusses subjugating humanity and [...]

---

Outline:

(01:56) Abstract

(03:29) Introduction

The original text contained 1 footnote which was omitted from this narration.

---

First published:

August 26th, 2025

Source:

https://www.lesswrong.com/posts/CwJ2qWveb9JbaCGQ5/harmless-reward-hacks-can-generalize-to-misalignment-in-llms

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,843 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,214 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

531 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,223 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners