January 30, 2023

LW - Compendium of problems with RLHF by Raphaël S

17 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Compendium of problems with RLHF, published by Raphaël S on January 29, 2023 on LessWrong.

Epistemic status: This post is a distillation of many comments/posts. I believe that my list of problems is not the best organization of sub-problems. I would like to make it shorter, and simpler, because cool theories are generally simple unified theories, by identifying only 2 or 3 main problems without aggregating problems with different types of gear level mechanisms, but currently I am too confused to be able to do so. Note that this post is not intended to address the potential negative impact of RLHF research on the world, but rather to identify the key technical gaps that need to be addressed for an effective alignment solution. Many thanks to Walter Laurito, Fabien Roger, Ben Hayum, Justis Mills for useful feedbacks.

RLHF tldr: We need a reward function, we cannot hand-write it, let’s make the AI learn it!

Problem 0: RLHF is confusing. Human judgment and feedback is so brittle, that even junior alignment researchers like me thought that RLHF is a not-too-bad-solution to the outer alignment problem. I think RLHF confuses a lot of people and distracts people from the core issues. Here is my attempt to become less confused.

Why RLHF counts as progress?

Manual reward crafting : “By comparison, we took two hours to write our own reward function (the animation in the above right) to get a robot to backflip, and though it succeeds it’s a lot less elegant than the one trained simply through human feedback”

RLHF learned to backflip using around 900 individual bits of feedback from the human evaluator.

From Learning from Human Preferences.

Without RLHF, approximating the reward function for a “good” backflip was tedious and almost impossible. With RLHF it is possible to obtain a reward function, that when optimized by an RL policy leads to beautiful backflips.

Value is complex? Value is fragile? But being polite is complex, being polite is fragile, and we can implement this roughly in ChatGPT.

“RLHF withstands more optimization pressure than supervised fine-tuning: By using RLHF you can obtain reward functions, that withstand more optimization pressure, than traditionally hand-crafted reward functions. Insofar as that's a key metric we care about, it counts as progress.” [Richard’s comment]

Why RLHF is insufficient?

Buck highlights two main types problems with using RLHF to create an AGI: oversight issues and the potential for catastrophic outcomes. This partition of the problem space comes from the post Low-Stakes alignment. Although it may be feasible to categorize all problems under this partition, I believe that this categorization is not granular enough and lumps different types of problems together.

Davidad suggests dividing the use of RLHF into two categories: "Vanilla-RLHF-Plan" and "Broadly-Construed-RLHF" as part of the alignment plan. The "Vanilla-RLHF-plan" refers to the narrow use of RLHF (say as used in ChatGPT) and is not sufficient for alignment, while "Broadly-Construed-RLHF" refers to the more general use of RLHF as a building block and is potentially useful for alignment. The list below outlines the problems with "Vanilla-RLHF" that "Broadly-Construed-RLHF" should aim to address.

Existing problems with RLHF because of (currently) non-robust ML systems

Those issues seem to be mostly a result of poor capabilities, and I think those problems may disappear as models grow larger.

1. Benign Failures: ChatGPT can fail. And it is not clear if this problem will disappear as capabilities scale.

2. Mode Collapse, i.e. a drastic bias toward particular completions and patterns. Mode collapse is expected when doing RL.

3. You need regularization. Your model can severely diverge from the model you would have gotten if had gotten feedback in real-time from real humans. You ne...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 30, 2023

LW - Compendium of problems with RLHF by Raphaël S

17 minutes

RLHF tldr: We need a reward function, we cannot hand-write it, let’s make the AI learn it!

Why RLHF counts as progress?

RLHF learned to backflip using around 900 individual bits of feedback from the human evaluator.

From Learning from Human Preferences.

Value is complex? Value is fragile? But being polite is complex, being polite is fragile, and we can implement this roughly in ChatGPT.

Why RLHF is insufficient?

Existing problems with RLHF because of (currently) non-robust ML systems

Those issues seem to be mostly a result of poor capabilities, and I think those problems may disappear as models grow larger.

1. Benign Failures: ChatGPT can fail. And it is not clear if this problem will disappear as capabilities scale.

2. Mode Collapse, i.e. a drastic bias toward particular completions and patterns. Mode collapse is expected when doing RL.

3. You need regularization. Your model can severely diverge from the model you would have gotten if had gotten feedback in real-time from real humans. You ne...

...more

Share LW - Compendium of problems with RLHF by Raphaël S

Sign up to save your podcasts

LW - Compendium of problems with RLHF by Raphaël S

LW - Compendium of problems with RLHF by Raphaël S