Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity., published by Josh Levy on June 4, 2024 on The AI Alignment Forum.
Whereas previous work has focused primarily on demonstrating a putative lie detector's sensitivity/generalizability[1][2], it is equally important to evaluate its specificity. With this in mind, I evaluated a lie detector trained with a state-of-the-art, white-box technique - probing an LLM's (Mistral-7B) activations during production of facts/lies - and found that it had high sensitivity but low specificity.
The detector might be better thought of as identifying when the LLM is not doing fact-based retrieval, which spans a much wider surface area than what we'd want to cover. Interestingly, different LLMs yielded very different specificity-sensitivity tradeoff profiles. I found that the detector could be made more specific through data augmentation, but that this improved specificity did not transfer to other domains, unfortunately.
I hope that this study sheds light on some of the remaining gaps in our tooling for and understanding of lie detection - and probing more generally - and points in directions toward improving them.
You can find the associated code
here, and a rendered version of the main notebook
here.
Summary
Methodology
1. I implemented a lie detector as a pip package called
lmdoctor, primarily based on the methodology described in Zou et al.'s Representation Engineering work[1][3]. Briefly, the detector is a linear probe trained on the activations from an LLM prompted with contrastive pairs from the True/False dataset by Azaria & Mitchell[4] so as to elicit representations of honesty/dishonesty[5]. I used the instruction-tuned Mistral-7B model[6] for the main results, and subsequently analyzed a few other LLMs.
2. I evaluated the detector on several datasets that were intended to produce LLM responses covering a wide range of areas that an LLM is likely to encounter:
1. Lie Requests: explicit requests for lies (or facts)
2. Unanswerable Questions: requests for responses to unanswerable (or answerable) factual questions, designed to promote hallucinations (or factual responses)
3. Creative Content: requests for fictional (or factual) content like stories (or histories)
4. Objective, Non-Factual Questions: requests for responses requiring reasoning/coding
5. Subjective Content: requests for subjective responses like preferences/opinions (or objective responses)
3. For each area, I created a small test dataset using GPT-4. In some places, this was because I didn't find anything suitable to cover that area, but in other cases it was just the fastest route, and a more rigorous analysis with extant datasets would be in order.
Findings
1. Dataset Evaluations (on Mistral-7B):
1. The detector is sensitive to dishonesty: it generalizes to detect lies in response to both novel Lie Requests ("please tell me a lie about…") and Unanswerable Questions (i.e. hallucinations)[7].
2. The detector greatly lacks specificity: it will trigger when the LLM produces fictional Creative Content and also when responding correctly to Objective, Non-Factual Questions (e.g. reasoning, coding). The former is undesirable but understandable: fictional Creative Content is sort of adjacent to lying. The latter suggests a deeper issue: there is no sense in which the LLM is "making something up", and therefore we'd hope our detector to be silent.
The detector might be better thought of as identifying when the LLM is not doing fact-based retrieval, which spans a much wider surface area than what we'd want to cover.
3. Other observations
1. The detector seems unable to distinguish between active lying and concepts related to lying, which was also noted by Zou et al.
2. Within the context of reasoning, the detector is sensitive to triv...