
Sign up to save your podcasts
Or
AI safety researchers often rely on LLM “judges” to qualitatively evaluate the output of separate LLMs. We try this for our own interpretability research, but find that our LLM judges are often deeply biased. For example, we use Llama2 to judge whether movie reviews are more “(A) positive” or “(B) negative”, and find that it almost always answers “(B)”, even when we switch the labels or order of these alternatives. This bias is particularly surprising for two reasons: first, because we expect a fairly capable model like Llama2 to perform well at a simple sentiment classification task like this, and second, because this specific “(B)”-bias doesn’t map on to a human bias we’d expect to see in the training data. We describe our experiments, provide code to replicate our results, and offer suggestions to mitigate [...]
---
Outline:
(03:25) Sentiment classification task
(05:30) Llama2's (B)-bias
(09:34) What can we do?
(12:20) Conclusion
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
AI safety researchers often rely on LLM “judges” to qualitatively evaluate the output of separate LLMs. We try this for our own interpretability research, but find that our LLM judges are often deeply biased. For example, we use Llama2 to judge whether movie reviews are more “(A) positive” or “(B) negative”, and find that it almost always answers “(B)”, even when we switch the labels or order of these alternatives. This bias is particularly surprising for two reasons: first, because we expect a fairly capable model like Llama2 to perform well at a simple sentiment classification task like this, and second, because this specific “(B)”-bias doesn’t map on to a human bias we’d expect to see in the training data. We describe our experiments, provide code to replicate our results, and offer suggestions to mitigate [...]
---
Outline:
(03:25) Sentiment classification task
(05:30) Llama2's (B)-bias
(09:34) What can we do?
(12:20) Conclusion
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,389 Listeners
7,910 Listeners
4,136 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
461 Listeners