Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ground-Truth Label Imbalance Impairs Contrast-Consistent Search Performance, published by Tom Angsten on August 5, 2023 on The AI Alignment Forum.
Contrast-Consistent Search (CCS) is a method for finding truthful directions within the activation spaces of large language models (LLMs) in an unsupervised way, introduced in Burns et al., 2022. However, all experiments in that study involve training datasets that are balanced with respect to the ground-truth labels of the questions used to generate contrast pairs.[1] This allows for the possibility that CCS performance is implicitly dependent on the balance of ground-truth labels, and therefore is not truly unsupervised.
In this work, we show that the imbalance of ground-truth labels in the training dataset can prevent CCS from consistently finding truthful directions in an LLM's activation space.
Below is a plot of CCS performance versus ground-truth label imbalance for the IMDB dataset, which was one of the datasets used in the original paper. We discuss in the write-up the possible mechanisms for this observed reduction in performance as imbalance becomes more severe.
Relevance to Alignment
One can imagine training datasets with arbitrarily severely imbalanced ground-truth labels, such as questions pertaining to anomaly detection (e.g., a dataset formed from the prompt template "Is this plan catastrophic to humanity? {{gpt_n_proposed_plan}} Yes or no?", to which the ground-truth label is hopefully "no" a vast majority of the time). We show that CCS can perform poorly on a heavily imbalanced dataset, and therefore should not be trusted in fully unsupervised applications without further improvements to the CCS method.
Note: Our original goal was to replicate Burns et al. (2022), and, during this process, we noticed the implicit assumption around balanced ground-truth labels. We're new to technical alignment research, and although we believe that performance degradation caused by imbalance could be an important consideration for future alignment applications of CCS (or similar unsupervised methods), we lack the necessary experience to fully justify this belief.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.