In this episode:
• The Benchmark Treadmill: Linda introduces the problem with existing ML benchmarks, noting they are often either too easy or too artificial. Professor Norris adds witty commentary on how quickly new models seem to 'solve' and saturate these tests.
• Let's Ask the Unanswerable: Linda presents the core idea from the UQ paper: evaluating models on genuinely unsolved questions from platforms like Stack Exchange. Professor Norris and Linda discuss how this hits a sweet spot between being difficult and realistic.
• How to Find a Good Unsolved Question: The hosts dive into the meticulous creation of the UQ-Dataset. Linda explains the three-stage filtering pipeline, and Professor Norris expresses his appreciation for the rigor involved in finding high-quality, truly unsolved problems.
• Who Validates the Validator?: With no ground truth answers, how do you score the models? Linda explains the clever 'UQ-Validator' system and the 'generator-validator gap,' while Professor Norris highlights the crucial role of the community platform for human verification.
• Pushing the Frontier of Knowledge... Slowly: Linda and Professor Norris review the humbling results, where even top models pass the validator on only 15% of questions. They discuss the implications of this new, more challenging evaluation paradigm for the future of AI research.