Mechanical Dreams

Perplexity Cannot Always Tell Right from Wrong


Listen Later

In this episode:
• Introduction: The Gold Standard in Question: Professor Norris and Linda introduce the episode's paper from DeepMind, setting the stage by defining perplexity and explaining why it is universally used to evaluate language models.
• The Copy Task and the Illusion of Confidence: Linda breaks down the theoretical proof using a bitstring copy task, explaining how high confidence in a decoder-only Transformer mathematically guarantees blind spots that perplexity fails to capture.
• Iso-Perplexity Curves: Trading Accuracy for Arrogance: The hosts dive into the mathematics of iso-perplexity curves. Professor Norris plays devil's advocate while Linda explains how an overconfident, less accurate model can achieve a better perplexity score than a well-calibrated, accurate one.
• Out of Distribution: Gemma 3 and the Parity Problem: They discuss the empirical evidence, including tests on Gemma 3 4B and the parity prediction task, highlighting how distribution shifts exacerbate the metric's failure and cause vanishing gradients.
• Conclusions: Rethinking Model Selection: Wrapping up the discussion, Norris and Linda summarize the practical implications for AI researchers and discuss what the community should consider instead of blindly trusting perplexity.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk