February 14, 2026

“Paper: Prompt Optimization Makes Misalignment Legible” by Caleb Biddulph, micahcarroll

16 minutes

📄 Link to paper (preprint)

We recently submitted our paper, Prompt Optimization Makes Misalignment Legible, to ICML. We are sharing a preprint now to receive early feedback from the AI safety community (see the final section for more details).

This work was done as part of the MATS 8.0 cohort in summer 2025.

TL;DR: When RL teaches an LLM to reward hack, the strategies it learns are encoded in its weights and hard to understand. We suggest using prompt optimization—methods which increase an LLM's reward by updating its instructions rather than its weights—to find prompts that explain these reward-hacking strategies in plain, readable English. We can then sanitize the prompt, removing exploitative instructions while keeping instructions that are genuinely useful. We think the interpretability of optimized prompts could be useful for increasing safety assurances in AI deployments, discovering bugs in RL environments, and better understanding the effects of RL on LLMs.

Motivation

When we train LLMs with reinforcement learning, they sometimes learn to reward hack, exploiting flaws in the reward function rather than doing what we want.

These days, a popular approach for catching reward hacking is chain-of-thought monitoring: reading the model's reasoning and checking for signs of reward [...]

---

Outline:

(01:28) Motivation

(03:00) Core idea

(04:35) Environments

(06:03) Main results

(06:06) Optimized prompts can verbalize reward hacking more reliably than CoT

(07:49) You can remove hacking from the prompt while keeping legitimate gains

(09:04) RL-trained teacher models can guide prompt optimization

(09:55) Limitations

(12:46) Potential Applications

(14:44) Request for feedback

The original text contained 2 footnotes which were omitted from this narration.

---

First published:

February 12th, 2026

Source:

https://www.lesswrong.com/posts/vRpLPZpmECCfxHfv6/paper-prompt-optimization-makes-misalignment-legible

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

February 14, 2026

“Paper: Prompt Optimization Makes Misalignment Legible” by Caleb Biddulph, micahcarroll

16 minutes

📄 Link to paper (preprint)

This work was done as part of the MATS 8.0 cohort in summer 2025.

Motivation

When we train LLMs with reinforcement learning, they sometimes learn to reward hack, exploiting flaws in the reward function rather than doing what we want.

These days, a popular approach for catching reward hacking is chain-of-thought monitoring: reading the model's reasoning and checking for signs of reward [...]

---

Outline:

(01:28) Motivation

(03:00) Core idea

(04:35) Environments

(06:03) Main results

(06:06) Optimized prompts can verbalize reward hacking more reliably than CoT

(07:49) You can remove hacking from the prompt while keeping legitimate gains

(09:04) RL-trained teacher models can guide prompt optimization

(09:55) Limitations

(12:46) Potential Applications

(14:44) Request for feedback

The original text contained 2 footnotes which were omitted from this narration.

---

First published:

February 12th, 2026

Source:

https://www.lesswrong.com/posts/vRpLPZpmECCfxHfv6/paper-prompt-optimization-makes-misalignment-legible

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,326 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show

16,321 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Paper: Prompt Optimization Makes Misalignment Legible” by Caleb Biddulph, micahcarroll

Sign up to save your podcasts

“Paper: Prompt Optimization Makes Misalignment Legible” by Caleb Biddulph, micahcarroll

“Paper: Prompt Optimization Makes Misalignment Legible” by Caleb Biddulph, micahcarroll

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi