This episode explores TUMIX, a test-time scaling framework that turns a single strong language model into a team of specialized agents with different tool-use strategies, including plain-text reasoning, code execution, search, and hybrids. It explains the paper’s core argument that better reasoning may come not from simply sampling one model more times, but from diversifying computational pathways and letting those agents iteratively refine each other under roughly cost-matched settings. The discussion situates TUMIX within prior work on inference-time compute, program-aided reasoning, and tool-using agents, while also probing whether the approach is genuinely novel or mostly a systems-level formalization of practices already emerging in industry. Listeners would find it interesting for its concrete framing of a major open question in AI: how to orchestrate tools and agent diversity to improve reasoning without exploding latency and cost.
Sources:
1. TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture — Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, Tomas Pfister, Jinsung Yoon, 2025
http://arxiv.org/abs/2510.01279
2. PAL: Program-aided Language Models — Luyu Gao, Shafiq Joty, Caiming Xiong, Irwin King, Steven C. H. Hoi, 2022
https://scholar.google.com/scholar?q=PAL:+Program-aided+Language+Models
3. ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, 2023
https://scholar.google.com/scholar?q=ReAct:+Synergizing+Reasoning+and+Acting+in+Language+Models
4. Mixture-of-Agents Enhances Large Language Model Capabilities — Chi Wang, Xun Wang, Silvio Savarese, Caiming Xiong, Diyi Yang and collaborators, 2024
https://scholar.google.com/scholar?q=Mixture-of-Agents+Enhances+Large+Language+Model+Capabilities
5. Search-Augmented Factuality in Language Models: Challenges and Opportunities for Retrieval-Grounded Generation — Representative survey literature; e.g., researchers across academia and industry on retrieval-augmented and search-grounded generation, 2023-2025
https://scholar.google.com/scholar?q=Search-Augmented+Factuality+in+Language+Models:+Challenges+and+Opportunities+for+Retrieval-Grounded+Generation
6. Automatic Prompt Engineer — Tristan Zhou, Shuyan Zhou, Tianyi Zhou, Jacob Andreas, Jason Wei and collaborators, 2022
https://scholar.google.com/scholar?q=Automatic+Prompt+Engineer
7. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines — Omar Khattab, Keshav Santhanam, and collaborators, 2023
https://scholar.google.com/scholar?q=DSPy:+Compiling+Declarative+Language+Model+Calls+into+Self-Improving+Pipelines
8. TextGrad: Automatic 'Differentiation' via Text — Chandar Lab and collaborators, 2024
https://scholar.google.com/scholar?q=TextGrad:+Automatic+'Differentiation'+via+Text
9. ADAS: Automated Design of Agentic Systems — Researchers working on LLM-based workflow and agent search, including recent 2024-2025 agentic-systems optimization efforts, 2024
https://scholar.google.com/scholar?q=ADAS:+Automated+Design+of+Agentic+Systems
10. Self-MoA — Li et al., 2025
https://scholar.google.com/scholar?q=Self-MoA
11. Symbolic-MoE — Unknown from excerpt, 2025
https://scholar.google.com/scholar?q=Symbolic-MoE
12. DEI — Unknown from excerpt, 2025
https://scholar.google.com/scholar?q=DEI
13. SciMaster — Unknown from excerpt, 2025
https://scholar.google.com/scholar?q=SciMaster
14. GSA — Unknown from excerpt, 2025
https://scholar.google.com/scholar?q=GSA
15. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Brown et al., 2024
https://scholar.google.com/scholar?q=Scaling+LLM+Test-Time+Compute+Optimally+can+be+More+Effective+than+Scaling+Model+Parameters
16. Language Models Can Solve Computer Tasks — Madaan et al., 2022
https://scholar.google.com/scholar?q=Language+Models+Can+Solve+Computer+Tasks
17. Program-of-Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks — Chen et al., 2022
https://scholar.google.com/scholar?q=Program-of-Thoughts+Prompting:+Disentangling+Computation+from+Reasoning+for+Numerical+Reasoning+Tasks
18. Humanity's Last Exam — Phan et al., 2025
https://scholar.google.com/scholar?q=Humanity's+Last+Exam
19. GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Rein et al., 2024
https://scholar.google.com/scholar?q=GPQA:+A+Graduate-Level+Google-Proof+Q&A+Benchmark
20. OpenAI/Gemini Deep Research comparison paper or report — Comanici et al., 2025
https://scholar.google.com/scholar?q=OpenAI/Gemini+Deep+Research+comparison+paper+or+report
21. DeepSeek-R1 or related RL reasoning paper — Guo et al., 2025
https://scholar.google.com/scholar?q=DeepSeek-R1+or+related+RL+reasoning+paper
22. Recent work showing Code Interpreter underuse in OpenAI models — Chen et al., 2024
https://scholar.google.com/scholar?q=Recent+work+showing+Code+Interpreter+underuse+in+OpenAI+models
23. Simple Test-Time Scaling — unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Simple+Test-Time+Scaling
24. Faster and Better LLMs via Latency-Aware Test-Time Scaling — unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Faster+and+Better+LLMs+via+Latency-Aware+Test-Time+Scaling
25. Thought Calibration: Efficient and Confident Test-Time Scaling — unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Thought+Calibration:+Efficient+and+Confident+Test-Time+Scaling
26. Reasoning Aware Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling — unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Reasoning+Aware+Self-Consistency:+Leveraging+Reasoning+Paths+for+Efficient+LLM+Sampling
27. Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning — unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Latent+Self-Consistency+for+Reliable+Majority-Set+Selection+in+Short-+and+Long-Answer+Reasoning
28. Universal Self-Consistency for Large Language Model Generation — unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Universal+Self-Consistency+for+Large+Language+Model+Generation
29. The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems — unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=The+Hidden+Strength+of+Disagreement:+Unraveling+the+Consensus-Diversity+Tradeoff+in+Adaptive+Multi-Agent+Systems
30. Stay Focused: Problem Drift in Multi-Agent Debate — unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Stay+Focused:+Problem+Drift+in+Multi-Agent+Debate
31. Why Do Multi-Agent LLM Systems Fail? — unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Why+Do+Multi-Agent+LLM+Systems+Fail?
32. LLM-Based Agents for Tool Learning: A Survey — W. Xu et al., 2024/2025
https://scholar.google.com/scholar?q=LLM-Based+Agents+for+Tool+Learning:+A+Survey
33. AI Post Transformers: Multiagent Debate Improves Language Model Reasoning — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/multiagent-debate-improves-language-model-reasoning/
34. AI Post Transformers: Experimental Comparison of Agentic and Enhanced RAG — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-14-experimental-comparison-of-agentic-and-e-37d8bc.mp3
35. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3
36. AI Post Transformers: Generalist Reward Modeling with Inference-Time Scaling — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/generalist-reward-modeling-with-inference-time-scaling/
Interactive Visualization: TUMIX Multi-Agent Test-Time Scaling with Tools