LessWrong (30+ Karma)

“Run evals on base models too!” by orthonormal


Listen Later

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Creating more visibility for a comment thread with Rohin Shah.)

Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities.

Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer. There are two things to point out about this example:

  1. Running a simple eval on the post-RLHF model would reveal a much lower ELO than if you ran it on the base model, because it would generally find a way to lose. (In this [...]

---

First published:

April 4th, 2024

Source:

https://www.lesswrong.com/posts/dgFC394qZHgj2cWAg/run-evals-on-base-models-too

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,041 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,230 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

531 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,229 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners