This episode explores a paper claiming that reinforcement-learning post-training can produce large math-reasoning gains in 7B–8B instruction-tuned models while updating as few as 13 parameters through a TinyLoRA setup. The discussion explains how this differs from standard LoRA and full fine-tuning, why the result matters for ideas like intrinsic dimension, and why it may suggest RL is steering latent capabilities already present in pretrained models rather than teaching entirely new knowledge. It also contrasts supervised fine-tuning with RL for verifiable rewards, arguing that on benchmarks like GSM8K, AIME, AMC, and MATH500, RL may improve behaviors like search, persistence, and token allocation. Listeners would find it interesting because it probes whether headline-grabbing “reasoning” gains are genuine evidence of new reasoning ability or a surprisingly cheap way to better elicit and control capabilities models already have.
Sources:
1. Learning to Reason in 13 Parameters — John X. Morris, Niloofar Mireshghallah, Mark Ibrahim, Saeed Mahloujifar, 2026
http://arxiv.org/abs/2602.04118
2. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou, 2022
https://scholar.google.com/scholar?q=Chain-of-Thought+Prompting+Elicits+Reasoning+in+Large+Language+Models
3. STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning — Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah Goodman, Percy Liang, 2022
https://scholar.google.com/scholar?q=STaR:+Self-Taught+Reasoner+Bootstrapping+Reasoning+With+Reasoning
4. Let’s Verify Step by Step — Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, 2023
https://scholar.google.com/scholar?q=Let’s+Verify+Step+by+Step
5. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI authors, 2025
https://scholar.google.com/scholar?q=DeepSeek-R1:+Incentivizing+Reasoning+Capability+in+LLMs+via+Reinforcement+Learning
6. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, et al., 2021
https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models
7. LoRA-XS — Bałazy et al., 2025
https://scholar.google.com/scholar?q=LoRA-XS
8. The Intrinsic Dimension of Objective Landscapes — Chunyuan Li, Heerad Farkhoor, Rosanne Liu, Jason Yosinski, 2018
https://scholar.google.com/scholar?q=The+Intrinsic+Dimension+of+Objective+Landscapes
9. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning — Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta, 2020
https://scholar.google.com/scholar?q=Intrinsic+Dimensionality+Explains+the+Effectiveness+of+Language+Model+Fine-Tuning
10. VeRA — Kopiczko et al., 2023
https://scholar.google.com/scholar?q=VeRA
11. VB-LoRA — Li et al., 2024
https://scholar.google.com/scholar?q=VB-LoRA
12. AdaLoRA — Qingru Zhang, Minshuo Chen, Alexander Bukharin, et al., 2023
https://scholar.google.com/scholar?q=AdaLoRA
13. Prompt Tuning — Brian Lester, Rami Al-Rfou, Noah Constant, 2021
https://scholar.google.com/scholar?q=Prompt+Tuning
14. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Xiang Lisa Li, Percy Liang, 2021
https://scholar.google.com/scholar?q=Prefix-Tuning:+Optimizing+Continuous+Prompts+for+Generation
15. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models — Elad Ben Zaken, Yoav Goldberg, Shauli Ravfogel, 2022
https://scholar.google.com/scholar?q=BitFit:+Simple+Parameter-efficient+Fine-tuning+for+Transformer-based+Masked+Language-models
16. OpenAI o1 / Learning to Reason with Reinforcement Learning — OpenAI et al., 2024
https://scholar.google.com/scholar?q=OpenAI+o1+/+Learning+to+Reason+with+Reinforcement+Learning
17. DeepSeek-R1 / Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — Shao et al., 2024
https://scholar.google.com/scholar?q=DeepSeek-R1+/+Incentivizing+Reasoning+Capability+in+LLMs+via+Reinforcement+Learning
18. One Example Is Enough: Learning to Reason from Single Demonstrations with RL — Wang et al., 2025
https://scholar.google.com/scholar?q=One+Example+Is+Enough:+Learning+to+Reason+from+Single+Demonstrations+with+RL
19. A Thousand Examples Are Enough: Data-efficient SFT for Reasoning — Ye et al., 2025
https://scholar.google.com/scholar?q=A+Thousand+Examples+Are+Enough:+Data-efficient+SFT+for+Reasoning
20. DoRA / Weight-Decomposed Low-Rank Adaptation — Liu et al., 2024
https://scholar.google.com/scholar?q=DoRA+/+Weight-Decomposed+Low-Rank+Adaptation
21. Beyond Two-Stage Training / Beyond two-stage training: Cooperative SFT and RL for LLM reasoning — approx. recent LLM reasoning training papers, exact author list not confirmed from snippet, 2025-2026
https://scholar.google.com/scholar?q=Beyond+Two-Stage+Training+/+Beyond+two-stage+training:+Cooperative+SFT+and+RL+for+LLM+reasoning
22. Beyond Outcome Verification: Verifiable Process Reward Models for Structured Reasoning — approx. recent RLVR/process-reward-model authors, exact author list not confirmed from snippet, 2025-2026
https://scholar.google.com/scholar?q=Beyond+Outcome+Verification:+Verifiable+Process+Reward+Models+for+Structured+Reasoning
23. RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents — approx. recent RL/meta-reasoning authors, exact author list not confirmed from snippet, 2025-2026
https://scholar.google.com/scholar?q=RLVMR:+Reinforcement+Learning+with+Verifiable+Meta-Reasoning+Rewards+for+Robust+Long-Horizon+Agents
24. X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Molecular Design — approx. X-LoRA authors, exact author list not confirmed from snippet, 2024-2025
https://scholar.google.com/scholar?q=X-LoRA:+Mixture+of+Low-Rank+Adapter+Experts,+a+Flexible+Framework+for+Large+Language+Models+with+Applications+in+Protein+Mechanics+and+Molecular+Design
25. Task-Aware LoRA Adapter Composition via Similarity Retrieval in Vector Databases — approx. recent adapter-composition authors, exact author list not confirmed from snippet, 2025-2026
https://scholar.google.com/scholar?q=Task-Aware+LoRA+Adapter+Composition+via+Similarity+Retrieval+in+Vector+Databases
26. AI Post Transformers: NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/neurips-2025-reinforcement-learning-for-reasoning-in-large-language-models-with/
27. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3
28. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3
29. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3
30. AI Post Transformers: Simple Self-Distillation for Better Code Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-02-simple-self-distillation-for-better-code-cc88e0.mp3
Interactive Visualization: Learning to Reason with 13 Parameters