
Sign up to save your podcasts
Or
(Creating more visibility for a comment thread with Rohin Shah.)
Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities.
Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer. There are two things to point out about this example:
---
First published:
Source:
Narrated by TYPE III AUDIO.
(Creating more visibility for a comment thread with Rohin Shah.)
Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities.
Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer. There are two things to point out about this example:
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,389 Listeners
7,910 Listeners
4,136 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners