
Sign up to save your podcasts
Or


(Creating more visibility for a comment thread with Rohin Shah.)
Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities.
Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer. There are two things to point out about this example:
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrong(Creating more visibility for a comment thread with Rohin Shah.)
Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities.
Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer. There are two things to point out about this example:
---
First published:
Source:
Narrated by TYPE III AUDIO.

113,041 Listeners

130 Listeners

7,230 Listeners

531 Listeners

16,229 Listeners

4 Listeners

14 Listeners

2 Listeners