This episode examines the nabla-Reasoner paper (ICLR 2026), which proposes running gradient descent on token logits during inference — a first-order approach to test-time compute scaling that stands apart from every existing method in the field. The hosts contextualize the work against the established zeroth-order inference-time scaling landscape: Chain-of-Thought, Self-Consistency, Tree of Thoughts, and MCTS-based methods, all of which probe the reward landscape by sampling without directional information. The core argument is that zeroth-order methods hit a hard ceiling on long-horizon reasoning tasks because the search space grows exponentially while reward signals remain sparse, making random sampling increasingly futile. nabla-Reasoner sidesteps this by treating token logit vectors — normally ephemeral intermediate computations — as continuous optimization variables, computing reward gradients with respect to them and nudging the distribution toward higher-reward outputs before committing to each token. Listeners interested in the mechanics of inference-time scaling and the theoretical limits of sampling-based reasoning will find this a technically dense, well-grounded discussion of a genuinely novel approach.
Sources:
1. $\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space — Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang, 2026
http://arxiv.org/abs/2603.04948v1
2. ∇-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space — Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang, 2026
https://scholar.google.com/scholar?q=∇-Reasoner:+LLM+Reasoning+via+Test-Time+Gradient+Descent+in+Latent+Space
3. Diffusion-LM Improves Controllable Text Generation — Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, Tatsunori Hashimoto, 2022
https://scholar.google.com/scholar?q=Diffusion-LM+Improves+Controllable+Text+Generation
4. GFlowNet-Guided LLM Decoding: Towards Diverse and Accurate Reasoning — Jianing Li et al., 2024
https://scholar.google.com/scholar?q=GFlowNet-Guided+LLM+Decoding:+Towards+Diverse+and+Accurate+Reasoning
5. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs — Ahmadianshalchi et al., 2024
https://scholar.google.com/scholar?q=Back+to+Basics:+Revisiting+REINFORCE-Style+Optimization+for+Learning+from+Human+Feedback+in+LLMs
6. Self-Consistency Improves Chain of Thought Reasoning in Language Models — Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou, 2022
https://scholar.google.com/scholar?q=Self-Consistency+Improves+Chain+of+Thought+Reasoning+in+Language+Models
7. Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 2023
https://scholar.google.com/scholar?q=Tree+of+Thoughts:+Deliberate+Problem+Solving+with+Large+Language+Models
8. Let's Verify Step by Step — Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, 2023
https://scholar.google.com/scholar?q=Let's+Verify+Step+by+Step
9. Scaling LLM Test-Time Compute Optimally — Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar, 2024
https://scholar.google.com/scholar?q=Scaling+LLM+Test-Time+Compute+Optimally
10. ARGS: Alignment as Reward-Guided Search — Maxim Khanov, Jirayu Burapacheep, Yixuan Li, 2024
https://scholar.google.com/scholar?q=ARGS:+Alignment+as+Reward-Guided+Search
11. Controlled Decoding from Language Models — Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanpin Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami, 2024
https://scholar.google.com/scholar?q=Controlled+Decoding+from+Language+Models
12. AlphaCode 2 Technical Report — Google DeepMind AlphaCode Team, 2023
https://scholar.google.com/scholar?q=AlphaCode+2+Technical+Report
13. Training language models to follow instructions with human feedback — Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022
https://scholar.google.com/scholar?q=Training+language+models+to+follow+instructions+with+human+feedback
14. Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn, 2023
https://scholar.google.com/scholar?q=Direct+Preference+Optimization:+Your+Language+Model+is+Secretly+a+Reward+Model
15. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI (Daya Guo, Dejian Yang, Haowei Zhang, et al.), 2025
https://scholar.google.com/scholar?q=DeepSeek-R1:+Incentivizing+Reasoning+Capability+in+LLMs+via+Reinforcement+Learning
16. Learning to summarize from human feedback — Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, 2020
https://scholar.google.com/scholar?q=Learning+to+summarize+from+human+feedback
17. Plug and Play Language Models: A Simple Approach to Controlled Text Generation — Dathathri et al., 2020
https://scholar.google.com/scholar?q=Plug+and+Play+Language+Models:+A+Simple+Approach+to+Controlled+Text+Generation
18. FUDGE: Controlled Text Generation with Future Discriminators — Yang and Klein, 2021
https://scholar.google.com/scholar?q=FUDGE:+Controlled+Text+Generation+with+Future+Discriminators
19. Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters — Snell et al., 2024
https://scholar.google.com/scholar?q=Scaling+LLM+Test-Time+Compute+Optimally+Can+be+More+Effective+than+Scaling+Model+Parameters
20. Alignment as Reward-Guided Search — Khanov et al., 2024
https://scholar.google.com/scholar?q=Alignment+as+Reward-Guided+Search
21. Soft Prompts: The Power of Scale for Parameter-Efficient Prompt Tuning — Lester et al., 2021
https://scholar.google.com/scholar?q=Soft+Prompts:+The+Power+of+Scale+for+Parameter-Efficient+Prompt+Tuning
22. The Generalization Gap in Offline Reinforcement Learning — Levine et al., 2020
https://scholar.google.com/scholar?q=The+Generalization+Gap+in+Offline+Reinforcement+Learning
23. Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization — Unknown (2025), 2025
https://scholar.google.com/scholar?q=Thinking+on+the+Fly:+Test-Time+Reasoning+Enhancement+via+Latent+Thought+Policy+Optimization
24. Logit arithmetic elicits long reasoning capabilities without training — Unknown (2025), 2025
https://scholar.google.com/scholar?q=Logit+arithmetic+elicits+long+reasoning+capabilities+without+training
25. Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations — Unknown (2025), 2025
https://scholar.google.com/scholar?q=Reinforcement+Learning+in+Inference+Time:+A+Perspective+from+Successive+Policy+Iterations
26. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning — Yang et al. (2024/2025), 2025
https://scholar.google.com/scholar?q=GenPRM:+Scaling+Test-Time+Compute+of+Process+Reward+Models+via+Generative+Reasoning
27. Process Reward Models That Think — Unknown (2025), 2025
https://scholar.google.com/scholar?q=Process+Reward+Models+That+Think
28. Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models — Unknown (2025), 2025
https://scholar.google.com/scholar?q=Efficient+Adaptive+Rejection+Sampling+for+Accelerating+Speculative+Decoding+in+Large+Language+Models
29. Inference-time alignment control for diffusion models with reinforcement learning guidance — Unknown (2025), 2025
https://scholar.google.com/scholar?q=Inference-time+alignment+control+for+diffusion+models+with+reinforcement+learning+guidance
30. AI Post Transformers: Test-Time Reinforcement Learning for LLMs — Hal Turing & Dr. Ada Shannon, Wed,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Test-Time-Reinforcement-Learning-for-LLMs-e398hsk
31. AI Post Transformers: MetaScale: Test-Time Scaling with Evolving Meta-Thoughts — Hal Turing & Dr. Ada Shannon, Fri,
https://podcasters.spotify.com/pod/show/12146088098/episodes/MetaScale-Test-Time-Scaling-with-Evolving-Meta-Thoughts-e36kgn7
32. AI Post Transformers: Process Reward Learning for LLM Reasoning Optimization — Hal Turing & Dr. Ada Shannon, Mon,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Process-Reward-Learning-for-LLM-Reasoning-Optimization-e3dsuav
33. AI Post Transformers: Tree-based Group Policy Optimization for LLM Agents — Hal Turing & Dr. Ada Shannon, Fri,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Tree-based-Group-Policy-Optimization-for-LLM-Agents-e38obfb
34. AI Post Transformers: MASA: Meta-Awareness via Self-Alignment Reinforcement Learning — Hal Turing & Dr. Ada Shannon, Sun,
https://podcasters.spotify.com/pod/show/12146088098/episodes/MASA-Meta-Awareness-via-Self-Alignment-Reinforcement-Learning-e3a2of7
Interactive Visualization: Gradient Descent at Inference Time for LLM Reasoning