Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion: Challenges with Unsupervised LLM Knowledge Discovery, published by Seb Farquhar on December 18, 2023 on The AI Alignment Forum.
TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won't be directly helpful either (70%). We've written a paper about some of our detailed experiences with it.
Paper authors: Sebastian Farquhar*, Vikrant Varma*, Zac Kenton*, Johannes Gasteiger, Vlad Mikulik, and Rohin Shah. *Equal contribution, order randomised.
Credences are based on a poll of Seb, Vikrant, Zac, Johannes, Rohin and show single values where we mostly agree and ranges where we disagreed.
What does CCS try to do?
To us, CCS represents a family of possible algorithms aiming at solving an ELK-style problem that have the steps:
Knowledge-like property: write down a property that points at an LLM feature which represents the model's knowledge (or a small number of features that includes the model-knowledge-feature).
Formalisation: make that property mathematically precise so you can search for features with that property in an unsupervised way.
Search: find it (e.g., by optimising a formalised loss).
In the case of CCS, the knowledge-like property is negation-consistency, the formalisation is a specific loss function, and the search is unsupervised learning with gradient descent on a linear + sigmoid function taking LLM activations as inputs.
We were pretty excited about this. We especially liked that the approach is not supervised. Conceptually, supervising ELK seems really hard: it is too easy to confuse what you know, what you think the model knows, and what it actually knows. Avoiding the need to write down what-the-model-knows labels seems like a great goal.
Why we think CCS isn't working
We spent a lot of time playing with CCS and trying to make it work well enough to build a deception detector by measuring the difference between elicited model's knowledge and stated claims.[1] Having done this, we are now not very optimistic about CCS or things like it.
Partly, this is because the loss itself doesn't give much reason to think that it would be able to find a knowledge-like property and empirically it seems to find whatever feature in the dataset happens to be most prominent, which is very prompt-sensitive. Maybe something building off it could work in the future, but we don't think anything about CCS provides evidence that it would be likely to. As a result, we have basically returned to our priors about the difficulty of ELK, which are something between "very very difficult" and "approximately impossible" for a full solution, while mostly agreeing that partial solutions are "hard but possible".
What does the CCS loss say?
The CCS approach is motivated like this: we don't know that much about the model's knowledge, but probably it follows basic consistency properties. For example, it probably has something like Bayesian credences and when it believes A with some probability PA, it ought to believe A with probability 1PA.[2] So if we search in the LLM's feature space for features that satisfy this consistency property, the model's knowledge is going to be one of the things that satisfies it. Moreover, they hypothesise, there probably aren't that many things that satisfy this property, so we can easily check the handful that we get and find the one representing the model's knowledge.
When we dig into the CCS loss, it isn't clear that it really checks for what it's supposed to. In particular, we prove that arbitrary features, not jus...