January 12, 2023

"Models Don't 'Get Reward'" by Sam Ringer

Listen Later

10 minutes

https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight

When thinking about deception and RLHF training, a simplified threat model is something like this:

A model takes some actions.
If a human approves of these actions, the human gives the model some reward.
Humans can be deceived into giving reward in situations where they would otherwise not if they had more knowledge.
Models will take advantage of this so they can get more reward.
Models will therefore become deceptive.

Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?

I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (Curated & Popular)

By LessWrong

4.8

1212 ratings

January 12, 2023

"Models Don't 'Get Reward'" by Sam Ringer

Listen Later

10 minutes

https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight

When thinking about deception and RLHF training, a simplified threat model is something like this:

A model takes some actions.
If a human approves of these actions, the human gives the model some reward.
Humans can be deceived into giving reward in situations where they would otherwise not if they had more knowledge.
Models will take advantage of this so they can get more reward.
Models will therefore become deceptive.

Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?

I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.

...more

More shows like LessWrong (Curated & Popular)

Macro Voices by Hedge Fund Manager Erik Townsend

Macro Voices

3,075 Listeners

Odd Lots by Bloomberg

Odd Lots

1,967 Listeners

EconTalk by Russ Roberts

EconTalk

4,272 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,447 Listeners

Philosophy Bites by Edmonds and Warburton

Philosophy Bites

1,535 Listeners

ChinaTalk by Jordan Schneider

ChinaTalk

292 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

97 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

99 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

576 Listeners

Clearer Thinking with Spencer Greenberg by Spencer Greenberg

Clearer Thinking with Spencer Greenberg

138 Listeners

Razib Khan's Unsupervised Learning by Razib Khan

Razib Khan's Unsupervised Learning

208 Listeners

"Econ 102" with Noah Smith and Erik Torenberg by Turpentine

"Econ 102" with Noah Smith and Erik Torenberg

146 Listeners

Money Stuff: The Podcast by Bloomberg

Money Stuff: The Podcast

403 Listeners

Complex Systems with Patrick McKenzie (patio11) by Patrick McKenzie

Complex Systems with Patrick McKenzie (patio11)

142 Listeners

The Marginal Revolution Podcast by Mercatus Center at George Mason University

The Marginal Revolution Podcast

89 Listeners