The Nonlinear Library

LW - How LLMs are and are not myopic by janus


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How LLMs are and are not myopic, published by janus on July 25, 2023 on LessWrong.
Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture.
TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text.
Summary
Myopia in machine learning models can be defined in several ways. It could be the time horizon the model considers when making predictions (cognitive myopia), the time horizon the model takes into account when assessing its value (value myopia), or the degree to which the model considers the consequences of its decisions (consequence-blindness).
Both cognitively-myopic and consequence-blind models should not pursue objectives for instrumental reasons. This could avoid some important alignment failures, like power-seeking or deceptive alignment. However, these behaviors can still exist as terminal values, for example when a model is trained to predict power-seeking or deceptively aligned agents.
LLM pretraining is not cognitively myopic because there is an incentive to think about the future to improve immediate prediction accuracy, like when predicting the next move in a chess game.
LLM pretraining is not value/prediction myopic (does not maximize myopic prediction accuracy) because of the details of the transformer architecture. Training gradients flow through attention connections, so past computation is directly optimized to be useful when attended to by future computation. This incentivizes improving prediction accuracy over the entire sequence, not just the next token. This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
You can modify the transformer architecture to remove the incentive for non-myopic accuracy, but as expected, the modified architecture has worse scaling laws.
LLM pretraining on human data is consequence-blind as the training data is causally independent from the model's actions. This implies the model should predict actions without considering the effect of its actions on other agents, including itself. This makes the model miscalibrated, but likely makes alignment easier.
When LLMs are trained on data which has been influenced or generated by LLMs, the assumptions of consequence-blindness partially break down. It's not clear how this affects the training goal theoretically or in practice.
A myopic training goal does not ensure the model will learn myopic computation or behavior because inner alignment with the training goal is not guaranteed
Introduction
The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There's also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work.
Types of Myopia
1. Cognitive Myopia
One natural definition for myopia is that the model doesn't think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabilities handicap...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings