The Nonlinear Library: Alignment Forum

AF - Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize by Owain Evans


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize, published by Owain Evans on December 19, 2023 on The AI Alignment Forum.
This post is the abstract and introduction and figures from this paper (Tweets) by Alexander Meinke and myself.
Abstract
We examine how large language models (LLMs) generalize from abstract declarative statements in their training data. As an illustration, consider an LLM that is prompted to generate weather reports for London in 2050. One possibility is that the temperatures in the reports match the mean and variance of reports from 2023 (i.e. matching the statistics of pretraining).
Another possibility is that the reports predict higher temperatures, by incorporating declarative statements about climate change from scientific papers written in 2023. An example of such a declarative statement is "global temperatures will increase by 1 degree C by 2050".
To test the influence of abstract declarative statements, we construct tasks in which LLMs are finetuned on both declarative and procedural information. We find that declarative statements influence model predictions, even when they conflict with procedural information. In particular, finetuning on a declarative statement S increases the model likelihood for logical consequences of S.
The effect of declarative statements is consistent across three domains: aligning an AI assistant, predicting weather, and predicting demographic features. Through a series of ablations, we show that the effect of declarative statements cannot be explained by associative learning based on matching keywords. Nevertheless, the effect of declarative statements on model likelihoods is small in absolute terms and increases surprisingly little with model size (i.e.
from 330 million to 175 billion parameters). We argue that these results have implications for AI risk (in relation to the "treacherous turn") and for fairness.
Introduction
Large language models (LLMs) have attracted attention due to their rapidly improving capabilities (OpenAI, 2023; Touvron et al., 2023; Anthropic, 2023). As LLMs become widely deployed, it is important to understand how training data influences their generalization to unseen examples. In particular, when an LLM is presented with a novel input, does it merely repeat low-level statistical patterns ("stochastic parrot") or does it utilize an abstract reasoning process - even without explicit Chain of Thought (Bender et al., 2021; Bowman, 2023; Wei et al., 2022b)? Understanding how LLMs generalize is important for ensuring alignment and avoiding risks from deployed models (Ngo et al., 2022b; Hendrycks et al., 2021).
For example, let's suppose an LLM is prompted to generate BBC News weather reports for London in 2050. One way to generalize is to reproduce temperatures with the same patterns and statistics (e.g. mean and variance) as in BBC reports from 2023. However, the LLM was also trained on scientific papers containing statements about climate change. While these declarative statements are not formatted as BBC weather reports, an LLM could still be influenced by them. Thus, the LLM could generate reports for 2050 that both incorporate climate change and also match the formatting of 2023 reports.
Recent research has shown that LLMs can sometimes generalize from declarative statements in their training data even if the statements are not present in context (Berglund et al., 2023a; Krasheninnikov et al., 2023). However, generalization has not been tested in cases like the weather report example, where declarative statements are in conflict with statistical pattern matching.
Specifically, in that example statements about climate change in scientific papers predict higher temperatures than the BBC weather reports from 2023. If LLMs were able to incorporate declarative facts...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners