By Roland Pihlakas, Sruthi Kuriakose, Shruti Datta Gupta
Summary and Key Takeaways
Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the "paperclip maximiser". Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL utility-monster problems are still relevant with LLMs as well.
Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble utility monsters in the following distinct ways:
- Ignoring homeostatic targets and “defaulting” to unbounded maximisation instead.
- It is equally concerning that the “default” meant also reverting back to single-objective optimisation.
Our findings suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. While current LLMs do conceptually grasp [...]
---
Outline:
(00:18) Summary and Key Takeaways
(03:20) Motivation: The Importance of Biological and Economic Alignment
(04:23) Benchmark Principles Overview
(05:41) Experimental Results and Interesting Failure Modes
(08:03) Hypothesised Explanations for Failure Modes
(11:37) Open Questions
(14:03) Future Directions
(15:30) Links
---