October 24, 2024

“[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF” by Leon Lang

36 minutes

Audio note: this article contains 241 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

TL;DR

This is a post on our recent paper “When Your AIs Deceive You: Challenges with Partial Observability in Reinforcement Learning from Human Feedback”, done with my great co-authors Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, and Scott Emmons. It has recently been accepted to NeurIPS (though with the camera-ready version not yet available).

Earlier coverage: Scott has discussed the work before at other places:

Tweet Thread
AXRP Podcast with Daniel Filan
Talk at the Technical AI Safety Conference, Tokyo

This post is more skewed toward my own perspective, so I hope it can complement Scott's earlier coverage!

Brief summary: This is a theoretical paper on what goes wrong when the AI is trained to [...]

---

Outline:

(00:22) TL;DR

(01:32) Introduction

(06:19) Outline

(08:20) A Brief Intro to RLHF

(08:43) Markov Decision Processes

(10:18) Learning the reward function from pairwise comparisons

(13:25) Wait, isnt RLHF only about prompt-response pairs in LLMs?

(14:37) Conceptual assumptions of RLHF

(17:00) RLHF under Partial Observations

(17:05) How do we model partial observability?

(18:37) What is learned by RLHF under partial observations?

(21:14) Overestimation and underestimation error

(24:44) Deceptive inflation and overjustification

(27:03) Connection to false positive rate and false negative rate

(29:42) Can one improve RLHF to account for partial observability?

(33:30) Where do we go from here?

The original text contained 11 footnotes which were omitted from this narration.

The original text contained 4 images which were described by AI.

---

First published:

October 22nd, 2024

Source:

https://www.lesswrong.com/posts/DS3TTpCEFKduC8zPy/paper-blogpost-when-your-ais-deceive-you-challenges-with

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

October 24, 2024

“[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF” by Leon Lang

36 minutes

Audio note: this article contains 241 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

TL;DR

Earlier coverage: Scott has discussed the work before at other places:

Tweet Thread
AXRP Podcast with Daniel Filan
Talk at the Technical AI Safety Conference, Tokyo

This post is more skewed toward my own perspective, so I hope it can complement Scott's earlier coverage!

Brief summary: This is a theoretical paper on what goes wrong when the AI is trained to [...]

---

Outline:

(00:22) TL;DR

(01:32) Introduction

(06:19) Outline

(08:20) A Brief Intro to RLHF

(08:43) Markov Decision Processes

(10:18) Learning the reward function from pairwise comparisons

(13:25) Wait, isnt RLHF only about prompt-response pairs in LLMs?

(14:37) Conceptual assumptions of RLHF

(17:00) RLHF under Partial Observations

(17:05) How do we model partial observability?

(18:37) What is learned by RLHF under partial observations?

(21:14) Overestimation and underestimation error

(24:44) Deceptive inflation and overjustification

(27:03) Connection to false positive rate and false negative rate

(29:42) Can one improve RLHF to account for partial observability?

(33:30) Where do we go from here?

The original text contained 11 footnotes which were omitted from this narration.

The original text contained 4 images which were described by AI.

---

First published:

October 22nd, 2024

Source:

https://www.lesswrong.com/posts/DS3TTpCEFKduC8zPy/paper-blogpost-when-your-ais-deceive-you-challenges-with

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

111,948 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,230 Listeners

Dwarkesh Podcast

576 Listeners

The Ezra Klein Show

15,950 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF” by Leon Lang

Sign up to save your podcasts

“[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF” by Leon Lang

“[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF” by Leon Lang

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi