This episode explores a new benchmark suite, IMO-Bench, designed to test whether AI systems can do genuinely robust mathematical reasoning at Olympiad difficulty rather than merely produce correct final answers. It breaks down the benchmark into three distinct tasks—short-answer problem solving, full proof generation, and automatic proof grading—and argues that this decomposition better captures real mathematical competence than answer-centric evaluations like GSM8K or MATH, which may now be saturated or overly teachable. The discussion highlights why IMO-style problems are especially revealing: they require discovering invariants, constructions, and contradiction arguments that resist routine pattern matching and expose whether models can sustain long-horizon reasoning and self-correction. Listeners would find it interesting because it tackles a central question in AI evaluation—whether current benchmarks are measuring true reasoning or just benchmark-specific performance—and examines the promise and risks of using model-based autograders to scale proof assessment.
Sources:
1. Towards Robust Mathematical Reasoning — Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung, 2025
http://arxiv.org/abs/2511.01846
2. Training Verifiers to Solve Math Word Problems — Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman, 2021
https://scholar.google.com/scholar?q=Training+Verifiers+to+Solve+Math+Word+Problems
3. Measuring Mathematical Problem Solving With the MATH Dataset — Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt, 2021
https://scholar.google.com/scholar?q=Measuring+Mathematical+Problem+Solving+With+the+MATH+Dataset
4. Solving Quantitative Reasoning Problems with Language Models — Aakanksha Chowdhery and collaborators at Google Research, 2022
https://scholar.google.com/scholar?q=Solving+Quantitative+Reasoning+Problems+with+Language+Models
5. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI — Elliot Glazer and collaborators, 2024
https://scholar.google.com/scholar?q=FrontierMath:+A+Benchmark+for+Evaluating+Advanced+Mathematical+Reasoning+in+AI
6. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models — Suzgun Mirac, et al. (BIG-bench collaboration), 2022
https://scholar.google.com/scholar?q=Beyond+the+Imitation+Game:+Quantifying+and+Extrapolating+the+Capabilities+of+Language+Models
7. Holistic Evaluation of Language Models — Percy Liang, Rishi Bommasani, Tony Lee, Dmitriy Turbiner, and collaborators, 2022
https://scholar.google.com/scholar?q=Holistic+Evaluation+of+Language+Models
8. Dynabench: Rethinking Benchmarking in NLP — Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, and collaborators, 2021
https://scholar.google.com/scholar?q=Dynabench:+Rethinking+Benchmarking+in+NLP
9. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, et al., 2023
https://scholar.google.com/scholar?q=Judging+LLM-as-a-Judge+with+MT-Bench+and+Chatbot+Arena
10. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment — Jun Gao, Huanle Liu, et al., 2023
https://scholar.google.com/scholar?q=G-Eval:+NLG+Evaluation+using+GPT-4+with+Better+Human+Alignment
11. Automatic Evaluation of Mathematical Proofs in Natural Language: A Survey — Various survey authors in educational technology and AI, 2020-2024
https://scholar.google.com/scholar?q=Automatic+Evaluation+of+Mathematical+Proofs+in+Natural+Language:+A+Survey
12. Towards Robust Mathematical Reasoning — Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung, 2025
https://scholar.google.com/scholar?q=Towards+Robust+Mathematical+Reasoning
13. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs — Various authors in neural theorem proving and autoformalization, 2022-2024
https://scholar.google.com/scholar?q=Draft,+Sketch,+and+Prove:+Guiding+Formal+Theorem+Provers+with+Informal+Proofs
14. Solving Olympiad Geometry without Human Demonstrations — Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, et al., 2024
https://scholar.google.com/scholar?q=Solving+Olympiad+Geometry+without+Human+Demonstrations
15. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models — Kaiyu Yang, Aidan O'Gara, et al., 2023
https://scholar.google.com/scholar?q=LeanDojo:+Theorem+Proving+with+Retrieval-Augmented+Language+Models
16. FrontierMath — Glazer et al., 2024
https://scholar.google.com/scholar?q=FrontierMath
17. Humanity's Last Exam — Phan et al., 2025
https://scholar.google.com/scholar?q=Humanity's+Last+Exam
18. GSM8K: Training Verifiers to Solve Math Word Problems — Cobbe et al., 2021
https://scholar.google.com/scholar?q=GSM8K:+Training+Verifiers+to+Solve+Math+Word+Problems
19. Gemini Deep Think at IMO 2025 — Luong and Lockhart, 2025
https://scholar.google.com/scholar?q=Gemini+Deep+Think+at+IMO+2025
20. Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination — approx. 2025, authors unclear from snippet, 2025
https://scholar.google.com/scholar?q=Reasoning+or+Memorization?+Unreliable+Results+of+Reinforcement+Learning+Due+to+Data+Contamination
21. Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning — approx. 2025, authors unclear from snippet, 2025
https://scholar.google.com/scholar?q=Right+Is+Not+Enough:+The+Pitfalls+of+Outcome+Supervision+in+Training+LLMs+for+Math+Reasoning
22. Improve Mathematical Reasoning in Language Models by Automated Process Supervision — approx. 2025, authors unclear from snippet, 2025
https://scholar.google.com/scholar?q=Improve+Mathematical+Reasoning+in+Language+Models+by+Automated+Process+Supervision
23. MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision — approx. 2025, authors unclear from snippet, 2025
https://scholar.google.com/scholar?q=MM-PRM:+Enhancing+Multimodal+Mathematical+Reasoning+with+Scalable+Step-Level+Supervision
24. Solving Inequality Proofs with Large Language Models — approx. 2025, authors unclear from snippet, 2025
https://scholar.google.com/scholar?q=Solving+Inequality+Proofs+with+Large+Language+Models
25. Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning — approx. 2025, authors unclear from snippet, 2025
https://scholar.google.com/scholar?q=Beyond+Gold+Standards:+Epistemic+Ensemble+of+LLM+Judges+for+Formal+Mathematical+Reasoning
26. A Survey on Deep Learning for Theorem Proving — approx. survey authors unclear from snippet, recent
https://scholar.google.com/scholar?q=A+Survey+on+Deep+Learning+for+Theorem+Proving
27. Proving Theorems Recursively — approx. 2025, authors unclear from snippet, 2025
https://scholar.google.com/scholar?q=Proving+Theorems+Recursively
28. DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning — approx. 2025, authors unclear from snippet, 2025
https://scholar.google.com/scholar?q=DICE:+Detecting+In-distribution+Contamination+in+LLM's+Fine-tuning+Phase+for+Math+Reasoning
29. AI Post Transformers: Schoenfeld Theory Applied to Large Reasoning Models — Hal Turing & Dr. Ada Shannon, Sat,
https://podcast.do-not-panic.com/episodes/schoenfeld-theory-applied-to-large-reasoning-models/
30. AI Post Transformers: LLM Benchmark Robustness to Linguistic Variation — Hal Turing & Dr. Ada Shannon, Tue,
https://podcast.do-not-panic.com/episodes/llm-benchmark-robustness-to-linguistic-variation/
31. AI Post Transformers: Generalist Reward Modeling with Inference-Time Scaling — Hal Turing & Dr. Ada Shannon, Tue,
https://podcast.do-not-panic.com/episodes/generalist-reward-modeling-with-inference-time-scaling/
32. AI Post Transformers: Evolving Language Models Without Labels: EVOL-RL — Hal Turing & Dr. Ada Shannon, Fri,
https://podcast.do-not-panic.com/episodes/evolving-language-models-without-labels-evol-rl/
Interactive Visualization: IMO-Bench for Robust Mathematical Reasoning