This episode explores NVIDIA's Nemotron 3 white paper, which introduces a three-tier model family (Nano, Super, Ultra) built on a hybrid architecture combining Mamba-2 structured state space layers with Mixture-of-Experts routing, targeting simultaneous state-of-the-art accuracy, one-million-token context, and substantially higher inference throughput than comparable dense Transformer MoE models. The discussion traces how Mamba-2's fixed-size recurrent state eliminates the KV cache's linear memory growth — the central bottleneck for long-context and agentic workloads — and explains NVIDIA's novel LatentMoE extension, which projects tokens into a reduced latent dimension before expert routing to cut communication costs while activating more experts per token. Multi-Token Prediction from Meta FAIR appears as a training accelerant, predicting multiple future tokens simultaneously to improve both training efficiency and generation speed, while NVFP4, NVIDIA's 4-bit floating-point training format used for Super and Ultra, raises open questions about numerical stability during post-training. Listeners interested in how recent theoretical work on state space duality translates into a production-scale model family, or in the architectural tradeoffs enabling practical multi-agent pipelines at scale, will find the episode a concrete and technically grounded case study.
Sources:
1. NVIDIA Nemotron 3: Efficient and Open Intelligence — NVIDIA, :, Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla, Andrew Tao, Anjulie Agrusa, Ankur Verma, Ann Guan, Anubhav Mandarwal, Arham Mehta, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Ayush Dattagupta, Banghua Zhu, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bita Darvish Rouhani, Boris Ginsburg, Brandon Norick, Brandon Soubasis, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Carlo del Mundo, Chantal Hwang, Charles Wang, Cheng-Ping Hsieh, Chenghao Zhang, Chenhan Yu, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christopher Parisien, Collin Neale, Cyril Meurillon, Damon Mosk-Aoyama, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Lo, Daniel Rohrer, Daniel Serebrenik, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, Deepak Narayanan, Dhruv Nathawani, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dong Ahn, Duncan Riach, Dusan Stosic, Edgar Minasyan, Edward Lin, Eileen Long, Eileen Peters Long, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Tramel, Erick Galinkin, Erik Pounds, Evan Briones, Evelina Bakhturina, Evgeny Tsykunov, Faisal Ladhak, Fay Wang, Fei Jia, Felipe Soares, Feng Chen, Ferenc Galko, Frank Sun, Frankie Siino, Gal Hubara Agam, Ganesh Ajjanagadde, Gantavya Bhatt, Gargi Prasad, George Armstrong, Gerald Shen, Gorkem Batmaz, Grigor Nalbandyan, Haifeng Qian, Harsh Sharma, Hayley Ross, Helen Ngo, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hiren Upadhyay, Huizi Mao, Huy C Nguyen, Huy Q Nguyen, Iain Cunningham, Ido Galil, Ido Shahaf, Igor Gitman, Ilya Loshchilov, Itamar Schen, Itay Levy, Ivan Moshkov, Izik Golan, Izzy Putterman, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jian Zhang, Jiaqi Zeng, Jie Lou, Jimmy Zhang, Jinhang Choi, Jining Huang, Joey Conway, Joey Guman, John Kamalu, Johnny Greco, Jonathan Cohen, Joseph Jennings, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kai Xu, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khushi Bhardwaj, Kirthi Shankar, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Li Ding, Lizzie Wei, Lucas Liebenwein, Luis Vega, Maanu Grover, Maarten Van Segbroeck, Maer Rodrigues de Melo, Mahdi Nazemi, Makesh Narsimhan Sreedhar, Manoj Kilaru, Maor Ashkenazi, Marc Romeijn, Marcin Chochowski, Mark Cai, Markus Kliegl, Maryam Moosaei, Matt Kulka, Matvei Novikov, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Meredith Price, Michael Andersch, Michael Boone, Michael Evans, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Minseok Lee, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Najeeb Nabwani, Natalie Hereth, Nave Assaf, Negar Habibi, Neta Zmora, Netanel Haber, Nicola Sessions, Nidhi Bhatia, Nikhil Jukar, Nikki Pope, Nikolai Ludwig, Nima Tajbakhsh, Nir Ailon, Nirmal Juluru, Nishant Sharma, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Puny, Oren Tropp, Ouye Xie, Parth Chadha, Pasha Shamis, Paul Gibbons, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pinky Xu, Piotr Januszewski, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Qing Miao, Qiyu Wan, Rabeeh Karimi Mahabadi, Rachit Garg, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Rich Harang, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Hesse, Roger Waleffe, Rohit Watve, Roi Koren, Ruoxi Zhang, Russell Hewett, Russell J. Hewett, Ryan Prenger, Ryan Timbrook, Sadegh Mahdavi, Sahil Modi, Samuel Kriman, Sangkug Lim, Sanjay Kariyappa, Sanjeev Satheesh, Saori Kaji, Satish Pasumarthi, Saurav Muralidharan, Sean Narentharen, Sean Narenthiran, Seonmyeong Bak, Sergey Kashirsky, Seth Poulos, Shahar Mor, Shanmugam Ramasamy, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shiqing Fan, Shreya Gopal, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Simeng Sun, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Stas Sergienko, Stefania Alborghetti, Stephen Ge, Sugam Dipak Devare, Sumeet Kumar Barua, Suseella Panguluri, Suyog Gupta, Sweta Priyadarshi, Syeda Nahida Akter, Tan Bui, Teodor-Dumitru Ene, Terry Kong, Thanh Do, Tijmen Blankevoort, Tim Moon, Tom Balough, Tomer Asida, Tomer Bar Natan, Tomer Ronen, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Vahid Noroozi, Venkat Srinivasan, Venmugil Elango, Victor Cui, Vijay Korthikanti, Vinay Rao, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Wanli Jiang, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wenfei Zhou, Will Jennings, William Zhang, Wojciech Prazuch, Xiaowei Ren, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Yigong Qin, Ying Lin, Yonatan Geifman, Yonggan Fu, Yoshi Subara, Yoshi Suhara, Yubo Gao, Zach Moshe, Zhen Dong, Zhongbo Zhu, Zihan Liu, Zijia Chen, Zijie Yan, 2025
http://arxiv.org/abs/2512.20856
2. Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters — Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar (Google DeepMind), 2024
https://scholar.google.com/scholar?q=Scaling+LLM+Test-Time+Compute+Optimally+Can+be+More+Effective+than+Scaling+Model+Parameters
3. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI (Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, and many others), 2025
https://scholar.google.com/scholar?q=DeepSeek-R1:+Incentivizing+Reasoning+Capability+in+LLMs+via+Reinforcement+Learning
4. Thinking Tokens for Language Modeling — David Herel, Tomas Mikolov (Czech Institute of Informatics / FAIR), 2023
https://scholar.google.com/scholar?q=Thinking+Tokens+for+Language+Modeling
5. Training Large Language Models to Reason in a Continuous Latent Space (Coconut) — Shibo Hao, Sainbayar Sukhbaatar, Jason Weston, Yuandong Tian, Zhiting Hu (Meta FAIR), 2024
https://scholar.google.com/scholar?q=Training+Large+Language+Models+to+Reason+in+a+Continuous+Latent+Space+(Coconut)
6. Don't Think Too Much: Reasoning Budget Control for Large Reasoning Models — Xuan He, Zhenyu He, Qingyu Meng, Yonghao Zhong, Zhonglin Shi, Fandong Meng, Jie Zhou (Tencent AI Lab / WeChat AI), 2025
https://scholar.google.com/scholar?q=Don't+Think+Too+Much:+Reasoning+Budget+Control+for+Large+Reasoning+Models
7. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Dao, T. and Gu, A., 2024
https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality
8. Jamba: A Hybrid Transformer-Mamba Language Model — Lieber et al., AI21 Labs, 2024
https://scholar.google.com/scholar?q=Jamba:+A+Hybrid+Transformer-Mamba+Language+Model
9. Better & Faster Large Language Models via Multi-Token Prediction — Gloeckle et al., 2024
https://scholar.google.com/scholar?q=Better+&+Faster+Large+Language+Models+via+Multi-Token+Prediction
10. RULER: What's the Real Context Size of Your Long-Context Language Models? — Hsieh et al., 2024
https://scholar.google.com/scholar?q=RULER:+What's+the+Real+Context+Size+of+Your+Long-Context+Language+Models?
11. DeepSeek-V3 Technical Report — DeepSeek-AI, 2025
https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report
12. Mixtral of Experts — Jiang et al., 2024
https://scholar.google.com/scholar?q=Mixtral+of+Experts
13. In-Context Learning with Long-Context Models: An In-Depth Exploration — Bertsch et al., 2024
https://scholar.google.com/scholar?q=In-Context+Learning+with+Long-Context+Models:+An+In-Depth+Exploration
14. Attn-QAT: 4-Bit Attention With Quantization-Aware Training — approximate, 2024–2025, 2024-2025
https://scholar.google.com/scholar?q=Attn-QAT:+4-Bit+Attention+With+Quantization-Aware+Training
15. FP4 All the Way: Fully Quantized Training of LLMs — approximate, 2024–2025, 2024-2025
https://scholar.google.com/scholar?q=FP4+All+the+Way:+Fully+Quantized+Training+of+LLMs
16. Optimizing Large Language Model Training Using FP4 Quantization — approximate, 2024–2025, 2024-2025
https://scholar.google.com/scholar?q=Optimizing+Large+Language+Model+Training+Using+FP4+Quantization
17. Speculative Decoding and Beyond: An In-Depth Survey of Techniques — approximate, 2024–2025, 2024-2025
https://scholar.google.com/scholar?q=Speculative+Decoding+and+Beyond:+An+In-Depth+Survey+of+Techniques
18. Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts — approximate, 2024–2025, 2024-2025
https://scholar.google.com/scholar?q=Latent+Prototype+Routing:+Achieving+Near-Perfect+Load+Balancing+in+Mixture-of-Experts
19. AI Post Transformers: Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-04-2-papers-combined-217168.mp3
20. AI Post Transformers: Switch Transformers: Trillion Parameter Models with Sparsity — Hal Turing & Dr. Ada Shannon, Wed,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Switch-Transformers-Trillion-Parameter-Models-with-Sparsity-e373fd3
21. AI Post Transformers: GLM-5: Transitioning from Vibe Coding to Agentic Engineering — Hal Turing & Dr. Ada Shannon, Fri,
https://podcasters.spotify.com/pod/show/12146088098/episodes/GLM-5-Transitioning-from-Vibe-Coding-to-Agentic-Engineering-e3fbfls
22. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, Sun,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Kimi-Linear-Efficient-Expressive-Attention-Architecture-e3aclec
23. AI Post Transformers: TailorKV: Hybrid KV Cache Compression for LLMs — Hal Turing & Dr. Ada Shannon, Wed,
https://podcasters.spotify.com/pod/show/12146088098/episodes/TailorKV-Hybrid-KV-Cache-Compression-for-LLMs-e38bmv2
24. AI Post Transformers: AWQ: On-Device LLM Compression and Acceleration — Hal Turing & Dr. Ada Shannon, Mon,
https://podcasters.spotify.com/pod/show/12146088098/episodes/AWQ-On-Device-LLM-Compression-and-Acceleration-e388njr
25. AI Post Transformers: LFM2-8B-A1B: Efficient On-Device Mixture-of-Experts — Hal Turing & Dr. Ada Shannon, Sun,
https://podcasters.spotify.com/pod/show/12146088098/episodes/LFM2-8B-A1B-Efficient-On-Device-Mixture-of-Experts-e3a2oh2
Interactive Visualization: NVIDIA Nemotron 3 Hybrid SSM Transformer Architecture