The Nonlinear Library

AF - Still no Lie Detector for LLMs by Daniel Herrmann


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Still no Lie Detector for LLMs, published by Daniel Herrmann on July 18, 2023 on The AI Alignment Forum.
Background
This post is a short version of a paper we wrote that you can find here. You can read this post to get the core ideas. You can read the paper to go a little deeper.
The paper is about probing decoder-only LLMs for their beliefs, using either unsupervised methods (like CCS from Burns) or supervised methods. We give both philosophical/conceptual reasons we are pessimistic and demonstrate some empirical failings using LLaMA 30b. By way of background, we're both philosophers, not ML people, but the paper is aimed at both audiences.
Introduction
One child says to the other "Wow! After reading some text, the AI understands what water is!". The second child says "All it understands is relationships between words. None of the words connect to reality. It doesn't have any internal concept of what water looks like or how it feels to be wet. ." .
Two angels are watching [some] chemists argue with each other. The first angel says "Wow! After seeing the relationship between the sensory and atomic-scale worlds, these chemists have realized that there are levels of understanding humans are incapable of accessing." The second angel says "They haven't truly realized it. They're just abstracting over levels of relationship between the physical world and their internal thought-forms in a mechanical way. They have no concept of [$!&&!@] or [#@&#]. You can't even express it in their language!"
Scott Alexander, Meaningful
Do large language models (LLMs) have beliefs? And, if they do, how might we measure them?
These questions are relevant as one important problem that plagues current LLMs is their tendency to generate falsehoods with great conviction. This is sometimes called lying and sometimes called hallucinating. One strategy for addressing this problem is to find a way to read the beliefs of an LLM directly off its internal state. Such a strategy falls under the broad umbrella of model interpretability, but we can think of it as a form of mind-reading. Detecting lies in LLMs has many obvious applications, and is especially relevant for things like ELK.
We tackle the question about the status of beliefs in LLMs head-on. We proceed in two stages. First, we assume that LLMs do have beliefs, and consider two current approaches for how we might measure them, due to Azaria and Mitchell and to Burns et al. We provide empirical results from LLaMA 30b that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs.
After describing our empirical results we take a step back and consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided and rely on a philosophical mistake. We provide a more productive framing of questions surrounding the status of beliefs in LLMs. Our analysis reveals both that there are many contexts in which we should expect systems to track the truth in order to accomplish other goals but that the question of whether or not LLMs have beliefs is largely an empirical matter. We provide code at.
Challenge in Deciphering the Beliefs of Language Models
For now, let's assume that in order to generate human-like text, LLMs (like humans) have beliefs about the world. We might then ask how we can measure and discover their beliefs. This question immediately leads to a number of problems:
Unreliable Self-Reporting
Asking an LLM directly about its beliefs is insufficient. As we've already discussed, models have a tendency to hallucinate or even lie. So...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings