Subtitle: A model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents..
In this post I present a model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents. This model suggests that given sufficiently rich training environments (and sufficient reasoning ability), models which terminally value on-episode reward-proxies are disadvantaged relative to training-gamers.
A key point is that training gamers can still contain large quantities of learned heuristics (context-specific drives). By viewing these drives as instrumental and having good instincts for when to trust them, a training-gamer can capture the benefits of both instinctive adaptation and explicit reasoning without paying much of a speed penalty.
Core claims
-
To get high reward, models must learn context-specific heuristics which are not derivable by reasoning from the goal of reward maximization.1 It is also true that thinking explicitly [...]
---
Outline:
(01:02) Core claims
(03:10) Characterizing training gamers and proxy-aligned models
(07:27) What about the case of humans and evolution? Wasn't it implausible to get an explicit fitness maximizer?
(08:25) What about the arguments in Reward is not the optimization target?
(09:24) What about aligned models?
(11:26) When will models think about reward?
(14:54) What I expect schemers to look like
(17:46) What will terminal reward seekers do off-distribution?
(18:29) What factors affect the likelihood of training-gaming?
(20:29) What about CoT models?
(21:02) Idealized relationship of subgoals to their superior goals
(22:50) A story about goal reflection
(25:28) Thoughts on compression
(28:19) Appendix: Distribution over worlds
(31:58) Canary string
(32:14) Acknowledgements
---