In this episode:
• Introduction: The Gold Standard in Question: Professor Norris and Linda introduce the episode's paper from DeepMind, setting the stage by defining perplexity and explaining why it is universally used to evaluate language models.
• The Copy Task and the Illusion of Confidence: Linda breaks down the theoretical proof using a bitstring copy task, explaining how high confidence in a decoder-only Transformer mathematically guarantees blind spots that perplexity fails to capture.
• Iso-Perplexity Curves: Trading Accuracy for Arrogance: The hosts dive into the mathematics of iso-perplexity curves. Professor Norris plays devil's advocate while Linda explains how an overconfident, less accurate model can achieve a better perplexity score than a well-calibrated, accurate one.
• Out of Distribution: Gemma 3 and the Parity Problem: They discuss the empirical evidence, including tests on Gemma 3 4B and the parity prediction task, highlighting how distribution shifts exacerbate the metric's failure and cause vanishing gradients.
• Conclusions: Rethinking Model Selection: Wrapping up the discussion, Norris and Linda summarize the practical implications for AI researchers and discuss what the community should consider instead of blindly trusting perplexity.