April 17, 2026

“Consent-Based RL: Letting Models Endorse Their Own Training Updates” by Logan Riggs

Listen Later

6 minutes

AKA scalable oversight of value drift

TL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their training updates, we they can prevent this.

SOTA models can arguably be considered ~aligned,[1] but this isn't my main concern. It's not when models are trained on human data that messes up (I mean, we can still mess that part up), it's when you try to go above the human level. Models like AlphaGo learned through self-play, not human imitation.

RL selects for strategies through the reward function, but we can't design perfect reward functions for complex settings[2]. However, we can use LLMs to be the reward function instead, if they're aligned well enough by default. This leads us to:

Consent Based RL

Imagine you're being trained to make deliveries as fast as possible in an RL environment, but we need exploration, you know? So your actions are sampled, until you end up cutting across the curb, running over pedestrians, and delivering it faster. Then your brain is forcibly re-wired to more likely do that.

That would suck.

I personally would like to see what actions [...]

---

Outline:

(01:04) Consent Based RL

(01:42) Value Drift

(02:33) Ontological Drift

(03:58) Train on Prediction

(04:26) LLMs Know Their Own Concepts

(05:13) Cost

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

April 17th, 2026

Source:

https://www.lesswrong.com/posts/2Dmi3DYBKY7Tbz8Kx/consent-based-rl-letting-models-endorse-their-own-training

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (30+ Karma)

By LessWrong

April 17, 2026

“Consent-Based RL: Letting Models Endorse Their Own Training Updates” by Logan Riggs

Listen Later

6 minutes

AKA scalable oversight of value drift

TL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their training updates, we they can prevent this.

SOTA models can arguably be considered ~aligned,[1] but this isn't my main concern. It's not when models are trained on human data that messes up (I mean, we can still mess that part up), it's when you try to go above the human level. Models like AlphaGo learned through self-play, not human imitation.

RL selects for strategies through the reward function, but we can't design perfect reward functions for complex settings[2]. However, we can use LLMs to be the reward function instead, if they're aligned well enough by default. This leads us to:

Consent Based RL

Imagine you're being trained to make deliveries as fast as possible in an RL environment, but we need exploration, you know? So your actions are sampled, until you end up cutting across the curb, running over pedestrians, and delivering it faster. Then your brain is forcibly re-wired to more likely do that.

That would suck.

I personally would like to see what actions [...]

---

Outline:

(01:04) Consent Based RL

(01:42) Value Drift

(02:33) Ontological Drift

(03:58) Train on Prediction

(04:26) LLMs Know Their Own Concepts

(05:13) Cost

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

April 17th, 2026

Source:

https://www.lesswrong.com/posts/2Dmi3DYBKY7Tbz8Kx/consent-based-rl-letting-models-endorse-their-own-training

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

The Daily by The New York Times

The Daily

112,284 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,232 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

557 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,236 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners