
Sign up to save your podcasts
Or


Can we trust AI to keep AI honest?
Having a human in the loop is already more illusion than reality, as the task of checking and overseeing LLM outputs is increasingly assigned to other LLMs. The problem is that these LLM judges tend to be biased in favor of the answers they generate themselves — even when the answers are wrong.
To understand why this is, and what we can do about it, listen to my conversation with AI safety researcher Taslim Mahbub. We'll talk about his research into self-preference bias, the surprising results of his experiments and some potential mitigation strategies, as outlined in this post on mitigating collusive self-preference: https://www.lesswrong.com/posts/nB7kAf8c4tvnvZ4u3/mitigating-collusive-self-preference-by-redaction-and-2
and this paper on mitigating self-preference through authorship obfuscation: https://arxiv.org/abs/2512.05379
As a bonus, if you're interested in Taslim's earlier research on using machine learning in service of biodiversity monitoring, here's the abstract of his paper on convolutional neural networks (CNN) for identifying bat species: https://ieeexplore.ieee.org/document/9311084
By Witch of GlitchCan we trust AI to keep AI honest?
Having a human in the loop is already more illusion than reality, as the task of checking and overseeing LLM outputs is increasingly assigned to other LLMs. The problem is that these LLM judges tend to be biased in favor of the answers they generate themselves — even when the answers are wrong.
To understand why this is, and what we can do about it, listen to my conversation with AI safety researcher Taslim Mahbub. We'll talk about his research into self-preference bias, the surprising results of his experiments and some potential mitigation strategies, as outlined in this post on mitigating collusive self-preference: https://www.lesswrong.com/posts/nB7kAf8c4tvnvZ4u3/mitigating-collusive-self-preference-by-redaction-and-2
and this paper on mitigating self-preference through authorship obfuscation: https://arxiv.org/abs/2512.05379
As a bonus, if you're interested in Taslim's earlier research on using machine learning in service of biodiversity monitoring, here's the abstract of his paper on convolutional neural networks (CNN) for identifying bat species: https://ieeexplore.ieee.org/document/9311084