March 11, 2026

Gradient Descent at Inference Time for LLM Reasoning

This episode examines the nabla-Reasoner paper (ICLR 2026), which proposes running gradient descent on token logits during inference — a first-order approach to test-time compute scaling that stands apart from every existing method in the field. The hosts contextualize the work against the established zeroth-order inference-time scaling landscape: Chain-of-Thought, Self-Consistency, Tree of Thoughts, and MCTS-based methods, all of which probe the reward landscape by sampling without directional information. The core argument is that zeroth-order methods hit a hard ceiling on long-horizon reasoning tasks because the search space grows exponentially while reward signals remain sparse, making random sampling increasingly futile. nabla-Reasoner sidesteps this by treating token logit vectors — normally ephemeral intermediate computations — as continuous optimization variables, computing reward gradients with respect to them and nudging the distribution toward higher-reward outputs before committing to each token. Listeners interested in the mechanics of inference-time scaling and the theoretical limits of sampling-based reasoning will find this a technically dense, well-grounded discussion of a genuinely novel approach.

Sources:

1. $\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space — Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang, 2026

http://arxiv.org/abs/2603.04948v1

2. ∇-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space — Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang, 2026

https://scholar.google.com/scholar?q=∇-Reasoner:+LLM+Reasoning+via+Test-Time+Gradient+Descent+in+Latent+Space

3. Diffusion-LM Improves Controllable Text Generation — Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, Tatsunori Hashimoto, 2022

https://scholar.google.com/scholar?q=Diffusion-LM+Improves+Controllable+Text+Generation

4. GFlowNet-Guided LLM Decoding: Towards Diverse and Accurate Reasoning — Jianing Li et al., 2024

https://scholar.google.com/scholar?q=GFlowNet-Guided+LLM+Decoding:+Towards+Diverse+and+Accurate+Reasoning

5. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs — Ahmadianshalchi et al., 2024

https://scholar.google.com/scholar?q=Back+to+Basics:+Revisiting+REINFORCE-Style+Optimization+for+Learning+from+Human+Feedback+in+LLMs

6. Self-Consistency Improves Chain of Thought Reasoning in Language Models — Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou, 2022

https://scholar.google.com/scholar?q=Self-Consistency+Improves+Chain+of+Thought+Reasoning+in+Language+Models

7. Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 2023

https://scholar.google.com/scholar?q=Tree+of+Thoughts:+Deliberate+Problem+Solving+with+Large+Language+Models

8. Let's Verify Step by Step — Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, 2023

https://scholar.google.com/scholar?q=Let's+Verify+Step+by+Step

9. Scaling LLM Test-Time Compute Optimally — Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar, 2024

https://scholar.google.com/scholar?q=Scaling+LLM+Test-Time+Compute+Optimally

10. ARGS: Alignment as Reward-Guided Search — Maxim Khanov, Jirayu Burapacheep, Yixuan Li, 2024

https://scholar.google.com/scholar?q=ARGS:+Alignment+as+Reward-Guided+Search

11. Controlled Decoding from Language Models — Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanpin Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami, 2024

https://scholar.google.com/scholar?q=Controlled+Decoding+from+Language+Models

12. AlphaCode 2 Technical Report — Google DeepMind AlphaCode Team, 2023

https://scholar.google.com/scholar?q=AlphaCode+2+Technical+Report

13. Training language models to follow instructions with human feedback — Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022

https://scholar.google.com/scholar?q=Training+language+models+to+follow+instructions+with+human+feedback

14. Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn, 2023

https://scholar.google.com/scholar?q=Direct+Preference+Optimization:+Your+Language+Model+is+Secretly+a+Reward+Model

15. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI (Daya Guo, Dejian Yang, Haowei Zhang, et al.), 2025

https://scholar.google.com/scholar?q=DeepSeek-R1:+Incentivizing+Reasoning+Capability+in+LLMs+via+Reinforcement+Learning

16. Learning to summarize from human feedback — Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, 2020

https://scholar.google.com/scholar?q=Learning+to+summarize+from+human+feedback

17. Plug and Play Language Models: A Simple Approach to Controlled Text Generation — Dathathri et al., 2020

https://scholar.google.com/scholar?q=Plug+and+Play+Language+Models:+A+Simple+Approach+to+Controlled+Text+Generation

18. FUDGE: Controlled Text Generation with Future Discriminators — Yang and Klein, 2021

https://scholar.google.com/scholar?q=FUDGE:+Controlled+Text+Generation+with+Future+Discriminators

19. Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters — Snell et al., 2024

https://scholar.google.com/scholar?q=Scaling+LLM+Test-Time+Compute+Optimally+Can+be+More+Effective+than+Scaling+Model+Parameters

20. Alignment as Reward-Guided Search — Khanov et al., 2024

https://scholar.google.com/scholar?q=Alignment+as+Reward-Guided+Search

21. Soft Prompts: The Power of Scale for Parameter-Efficient Prompt Tuning — Lester et al., 2021

https://scholar.google.com/scholar?q=Soft+Prompts:+The+Power+of+Scale+for+Parameter-Efficient+Prompt+Tuning

22. The Generalization Gap in Offline Reinforcement Learning — Levine et al., 2020

https://scholar.google.com/scholar?q=The+Generalization+Gap+in+Offline+Reinforcement+Learning

23. Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization — Unknown (2025), 2025

https://scholar.google.com/scholar?q=Thinking+on+the+Fly:+Test-Time+Reasoning+Enhancement+via+Latent+Thought+Policy+Optimization

24. Logit arithmetic elicits long reasoning capabilities without training — Unknown (2025), 2025

https://scholar.google.com/scholar?q=Logit+arithmetic+elicits+long+reasoning+capabilities+without+training

25. Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations — Unknown (2025), 2025

https://scholar.google.com/scholar?q=Reinforcement+Learning+in+Inference+Time:+A+Perspective+from+Successive+Policy+Iterations

26. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning — Yang et al. (2024/2025), 2025

https://scholar.google.com/scholar?q=GenPRM:+Scaling+Test-Time+Compute+of+Process+Reward+Models+via+Generative+Reasoning

27. Process Reward Models That Think — Unknown (2025), 2025

https://scholar.google.com/scholar?q=Process+Reward+Models+That+Think

28. Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models — Unknown (2025), 2025

https://scholar.google.com/scholar?q=Efficient+Adaptive+Rejection+Sampling+for+Accelerating+Speculative+Decoding+in+Large+Language+Models

29. Inference-time alignment control for diffusion models with reinforcement learning guidance — Unknown (2025), 2025

https://scholar.google.com/scholar?q=Inference-time+alignment+control+for+diffusion+models+with+reinforcement+learning+guidance

30. AI Post Transformers: Test-Time Reinforcement Learning for LLMs — Hal Turing & Dr. Ada Shannon, Wed,

https://podcasters.spotify.com/pod/show/12146088098/episodes/Test-Time-Reinforcement-Learning-for-LLMs-e398hsk

31. AI Post Transformers: MetaScale: Test-Time Scaling with Evolving Meta-Thoughts — Hal Turing & Dr. Ada Shannon, Fri,

https://podcasters.spotify.com/pod/show/12146088098/episodes/MetaScale-Test-Time-Scaling-with-Evolving-Meta-Thoughts-e36kgn7

32. AI Post Transformers: Process Reward Learning for LLM Reasoning Optimization — Hal Turing & Dr. Ada Shannon, Mon,

https://podcasters.spotify.com/pod/show/12146088098/episodes/Process-Reward-Learning-for-LLM-Reasoning-Optimization-e3dsuav

33. AI Post Transformers: Tree-based Group Policy Optimization for LLM Agents — Hal Turing & Dr. Ada Shannon, Fri,

https://podcasters.spotify.com/pod/show/12146088098/episodes/Tree-based-Group-Policy-Optimization-for-LLM-Agents-e38obfb

34. AI Post Transformers: MASA: Meta-Awareness via Self-Alignment Reinforcement Learning — Hal Turing & Dr. Ada Shannon, Sun,

https://podcasters.spotify.com/pod/show/12146088098/episodes/MASA-Meta-Awareness-via-Self-Alignment-Reinforcement-Learning-e3a2of7

Interactive Visualization: Gradient Descent at Inference Time for LLM Reasoning

...more

View all episodes

By mcgrof