LessWrong (30+ Karma)

“[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF” by Leon Lang


Listen Later

Audio note: this article contains 241 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

TL;DR

This is a post on our recent paper “When Your AIs Deceive You: Challenges with Partial Observability in Reinforcement Learning from Human Feedback”, done with my great co-authors Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, and Scott Emmons. It has recently been accepted to NeurIPS (though with the camera-ready version not yet available).

Earlier coverage: Scott has discussed the work before at other places:

  • Tweet Thread
  • AXRP Podcast with Daniel Filan
  • Talk at the Technical AI Safety Conference, Tokyo

This post is more skewed toward my own perspective, so I hope it can complement Scott's earlier coverage!

Brief summary: This is a theoretical paper on what goes wrong when the AI is trained to [...]

---

Outline:

(00:22) TL;DR

(01:32) Introduction

(06:19) Outline

(08:20) A Brief Intro to RLHF

(08:43) Markov Decision Processes

(10:18) Learning the reward function from pairwise comparisons

(13:25) Wait, isnt RLHF only about prompt-response pairs in LLMs?

(14:37) Conceptual assumptions of RLHF

(17:00) RLHF under Partial Observations

(17:05) How do we model partial observability?

(18:37) What is learned by RLHF under partial observations?

(21:14) Overestimation and underestimation error

(24:44) Deceptive inflation and overjustification

(27:03) Connection to false positive rate and false negative rate

(29:42) Can one improve RLHF to account for partial observability?

(33:30) Where do we go from here?

The original text contained 11 footnotes which were omitted from this narration.

The original text contained 4 images which were described by AI.

---

First published:

October 22nd, 2024

Source:

https://www.lesswrong.com/posts/DS3TTpCEFKduC8zPy/paper-blogpost-when-your-ais-deceive-you-challenges-with

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,393 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,268 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

529 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,306 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners