
Sign up to save your podcasts
Or


📄 Link to paper (preprint)
We recently submitted our paper, Prompt Optimization Makes Misalignment Legible, to ICML. We are sharing a preprint now to receive early feedback from the AI safety community (see the final section for more details).
This work was done as part of the MATS 8.0 cohort in summer 2025.
TL;DR: When RL teaches an LLM to reward hack, the strategies it learns are encoded in its weights and hard to understand. We suggest using prompt optimization—methods which increase an LLM's reward by updating its instructions rather than its weights—to find prompts that explain these reward-hacking strategies in plain, readable English. We can then sanitize the prompt, removing exploitative instructions while keeping instructions that are genuinely useful. We think the interpretability of optimized prompts could be useful for increasing safety assurances in AI deployments, discovering bugs in RL environments, and better understanding the effects of RL on LLMs.
Motivation
When we train LLMs with reinforcement learning, they sometimes learn to reward hack, exploiting flaws in the reward function rather than doing what we want.
These days, a popular approach for catching reward hacking is chain-of-thought monitoring: reading the model's reasoning and checking for signs of reward [...]
---
Outline:
(01:28) Motivation
(03:00) Core idea
(04:35) Environments
(06:03) Main results
(06:06) Optimized prompts can verbalize reward hacking more reliably than CoT
(07:49) You can remove hacking from the prompt while keeping legitimate gains
(09:04) RL-trained teacher models can guide prompt optimization
(09:55) Limitations
(12:46) Potential Applications
(14:44) Request for feedback
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong📄 Link to paper (preprint)
We recently submitted our paper, Prompt Optimization Makes Misalignment Legible, to ICML. We are sharing a preprint now to receive early feedback from the AI safety community (see the final section for more details).
This work was done as part of the MATS 8.0 cohort in summer 2025.
TL;DR: When RL teaches an LLM to reward hack, the strategies it learns are encoded in its weights and hard to understand. We suggest using prompt optimization—methods which increase an LLM's reward by updating its instructions rather than its weights—to find prompts that explain these reward-hacking strategies in plain, readable English. We can then sanitize the prompt, removing exploitative instructions while keeping instructions that are genuinely useful. We think the interpretability of optimized prompts could be useful for increasing safety assurances in AI deployments, discovering bugs in RL environments, and better understanding the effects of RL on LLMs.
Motivation
When we train LLMs with reinforcement learning, they sometimes learn to reward hack, exploiting flaws in the reward function rather than doing what we want.
These days, a popular approach for catching reward hacking is chain-of-thought monitoring: reading the model's reasoning and checking for signs of reward [...]
---
Outline:
(01:28) Motivation
(03:00) Core idea
(04:35) Environments
(06:03) Main results
(06:06) Optimized prompts can verbalize reward hacking more reliably than CoT
(07:49) You can remove hacking from the prompt while keeping legitimate gains
(09:04) RL-trained teacher models can guide prompt optimization
(09:55) Limitations
(12:46) Potential Applications
(14:44) Request for feedback
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,326 Listeners

130 Listeners

7,242 Listeners

559 Listeners

16,321 Listeners

4 Listeners

14 Listeners

2 Listeners