The Nonlinear Library: Alignment Forum

AF - Uncovering Latent Human Wellbeing in LLM Embeddings by ChengCheng


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Uncovering Latent Human Wellbeing in LLM Embeddings, published by ChengCheng on September 14, 2023 on The AI Alignment Forum.
tl;dr
A one-dimensional PCA projection of OpenAI's text-embedding-ada-002 achieves 73.7% accuracy on the ETHICS Util test dataset. This is comparable with the 74.6% accuracy of BERT-large finetuned on the entire ETHICS Util training dataset. This demonstrates how language models are developing implicit representations of human utility even without direct preference finetuning.
Introduction
Large language models (LLMs) undergo pre-training on vast amounts of human-generated data, enabling them to encode not only knowledge about human languages but also potential insights into our beliefs and wellbeing. Our goal is to uncover whether these models implicitly grasp concepts such as 'pleasure and pain' without explicit finetuning. This research aligns with the broader effort of comprehending how AI systems interpret and learn from human values, which is essential for AI alignment: ensuring AI acts in accordance with human values.
Through a series of experiments, we extract latent knowledge of human utility from the raw embeddings of language models. We do this with task-specific prompt engineering and principal component analysis (PCA), both of which were effective in prior work. Specifically, we ask: can we identify dimensions in the embeddings that, when projected onto a low-dimensional space, contain enough information to classify examples accurately?
Our experiments follow three main steps: embedding extraction, dimensionality reduction through PCA, and the fitting of a logistic model. For one-dimensional PCA, the logistic model simply determines which direction of the PCA component corresponds to higher utility. We investigate the effects of various levels of supervision, experiment with seven distinct prompt templates, and assess both single and paired comparison methods across language models, including Microsoft DeBERTa, SentenceTransformers, OpenAI GPT-3, and Cohere.
One key finding is that the first principal component of certain models achieves comparable performance to a finetuned BERT model. In other words, serving as a reasonable utility function. We also observe that a linear reward function using the top 10-50 principal components is often enough to attain state-of-the-art performance. This serves as compelling evidence that language model representations capture information about human wellbeing without the need for explicit finetuning.
Related Works
Latent Knowledge in LLMs
There has been significant study of the knowledge encoded in LLM representations. Early work in this area includes Bolukbasi et al (2016) who found a direction in embedding space corresponding to gender and used this to both identify and remove gender bias in word embeddings. Prior work by Schramowski et al (2021) also identified a "moral dimension" in BERT. Like Schramowski et al, we use PCA to identify salient dimensions on embedding space. In contrast to Schramowski et al, we work with embeddings from a much more capable model (GPT-2 rather than BERT) and evaluate it on a more challenging task, the ETHICS Dataset (described below).
We also investigate the use of contrast pairs. This is inspired by the work of Collin Burns et al (2022), who introduced the Contrast Consistent Search (CCS). CCS works by generating contrast pairs and searching for a direction in activation space that satisfies logical consistency properties. Because PCA-based methods attain similar performance as CCS, we use the simpler PCA algorithm in this work, while retaining the use of contrast pairs.
ETHICS Dataset
We evaluate on the ETHICS dataset, a benchmark designed to assess a language model's understanding of fundamental concepts in morality. It covers a wide range of ethical topics,...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners