LessWrong (30+ Karma)

“Paper: Prompt Optimization Makes Misalignment Legible” by Caleb Biddulph, micahcarroll


Listen Later

📄 Link to paper (preprint)

We recently submitted our paper, Prompt Optimization Makes Misalignment Legible, to ICML. We are sharing a preprint now to receive early feedback from the AI safety community (see the final section for more details).

This work was done as part of the MATS 8.0 cohort in summer 2025.

TL;DR: When RL teaches an LLM to reward hack, the strategies it learns are encoded in its weights and hard to understand. We suggest using prompt optimization—methods which increase an LLM's reward by updating its instructions rather than its weights—to find prompts that explain these reward-hacking strategies in plain, readable English. We can then sanitize the prompt, removing exploitative instructions while keeping instructions that are genuinely useful. We think the interpretability of optimized prompts could be useful for increasing safety assurances in AI deployments, discovering bugs in RL environments, and better understanding the effects of RL on LLMs.

 

Motivation

When we train LLMs with reinforcement learning, they sometimes learn to reward hack, exploiting flaws in the reward function rather than doing what we want.

These days, a popular approach for catching reward hacking is chain-of-thought monitoring: reading the model's reasoning and checking for signs of reward [...]

---

Outline:

(01:28) Motivation

(03:00) Core idea

(04:35) Environments

(06:03) Main results

(06:06) Optimized prompts can verbalize reward hacking more reliably than CoT

(07:49) You can remove hacking from the prompt while keeping legitimate gains

(09:04) RL-trained teacher models can guide prompt optimization

(09:55) Limitations

(12:46) Potential Applications

(14:44) Request for feedback

The original text contained 2 footnotes which were omitted from this narration.

---

First published:

February 12th, 2026

Source:

https://www.lesswrong.com/posts/vRpLPZpmECCfxHfv6/paper-prompt-optimization-makes-misalignment-legible

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,326 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,321 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners