Hal Turing and Dr. Ada Shannon examine Jet-Nemotron as a serious but narrow attempt to retrofit long-context efficiency into a pretrained dense Transformer rather than as a clean-sheet architecture revolution. They focus on NVIDIA’s PostNAS pipeline, which freezes the MLP pathway, treats attention layers as the remodel zone, and searches where full attention is still worth paying for versus where cheaper JetBlocks can replace it. The discussion keeps returning to the real question behind the paper’s marketing: whether this is evidence that linear-attention-style hybrids can genuinely change inference scaling and KV-cache pressure, or whether it is a carefully engineered optimization for a constrained deployment target that inherits most of its intelligence from the original dense model.
The episode makes the contrast with Nemotron 3 explicit. In the earlier Nemotron 3 story, the architectural pitch was a broader hybrid stack built around the interplay of dense Transformer machinery, mixture-of-experts routing, and state-space or recurrent-style efficiency ideas. Jet-Nemotron is different in both method and claim: it is not mainly about MoE capacity or an SSM-flavored redesign, but about post-training surgery on the attention stack itself, with layer placement search deciding where exact global lookup remains indispensable and where linear-style blocks can take over. That makes Jet-Nemotron feel less like a new foundation model family and more like a practical conversion recipe, which the hosts treat as both the paper’s most credible contribution and its main limitation.
They also place Jet-Nemotron directly against Kimi Linear and the broader efficient-LLM landscape. Both papers take linear attention seriously as a way to attack long-context serving bottlenecks, but the comparison here is not flattering by default: Kimi Linear looked more like a direct argument for a new sequence-mixing primitive, while Jet-Nemotron looks more convincing as an engineering workflow for salvaging pretrained dense checkpoints without retraining everything from scratch. The hosts parse where the similarities end, where the quality-preservation story still depends on keeping some full-attention layers alive, and why that matters for judging whether linear attention is becoming a real architectural shift or remains a selective compromise that works best when a dense Transformer still anchors the system.
Sources:
1. Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search — Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai, 2025
http://arxiv.org/abs/2508.15884
2. Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models — Soham De, Samuel L. Smith, Aleksandar Botev, Albert Gu, Caglar Gulcehre and collaborators, 2024
https://scholar.google.com/scholar?q=Griffin:+Mixing+Gated+Linear+Recurrences+with+Local+Attention+for+Efficient+Language+Models
3. Zamba: A Compact 7B SSM Hybrid Model — Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Beren Millidge and collaborators, 2024
https://scholar.google.com/scholar?q=Zamba:+A+Compact+7B+SSM+Hybrid+Model
4. Hymba: A Hybrid-head Architecture for Small Language Models — Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Pavlo Molchanov and collaborators, 2025
https://scholar.google.com/scholar?q=Hymba:+A+Hybrid-head+Architecture+for+Small+Language+Models
5. Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models — NVIDIA et al. (including Aaron Blakeman, Song Han, Jan Kautz and collaborators), 2025
https://scholar.google.com/scholar?q=Nemotron-H:+A+Family+of+Accurate+and+Efficient+Hybrid+Mamba-Transformer+Models
6. Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search — Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai, 2025
https://scholar.google.com/scholar?q=Jet-Nemotron:+Efficient+Language+Model+with+Post+Neural+Architecture+Search
7. Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction — Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, Yusuke Oishi, 2026
https://scholar.google.com/scholar?q=Distill-then-Replace:+Efficient+Task-Specific+Hybrid+Attention+Model+Construction
8. The Zamba2 Suite: Technical Report — Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, Beren Millidge, 2024
https://scholar.google.com/scholar?q=The+Zamba2+Suite:+Technical+Report
9. RecurrentGemma: Moving Past Transformers for Efficient Open Language Models — Aleksandar Botev, Soham De, Samuel L. Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Leonhard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, and others, 2024
https://scholar.google.com/scholar?q=RecurrentGemma:+Moving+Past+Transformers+for+Efficient+Open+Language+Models
10. Zoology: Measuring and Improving Recall in Efficient Language Models — Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Re, 2023
https://scholar.google.com/scholar?q=Zoology:+Measuring+and+Improving+Recall+in+Efficient+Language+Models
11. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024
https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality
12. Eigen Attention: Attention in Low-Rank Space for KV Cache Compression — approx. recent LLM systems/efficient inference authors, 2024/2025
https://scholar.google.com/scholar?q=Eigen+Attention:+Attention+in+Low-Rank+Space+for+KV+Cache+Compression
13. ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering — approx. recent efficient inference authors, 2024/2025
https://scholar.google.com/scholar?q=ClusterAttn:+KV+Cache+Compression+under+Intrinsic+Attention+Clustering
14. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution — approx. recent efficient inference authors, 2024/2025
https://scholar.google.com/scholar?q=Expected+Attention:+KV+Cache+Compression+by+Estimating+Attention+from+Future+Queries+Distribution
15. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — approx. recent hybrid-attention authors, 2024/2025
https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning
16. Scaling Linear Attention with Sparse State Expansion — approx. recent linear-attention scaling authors, 2024/2025
https://scholar.google.com/scholar?q=Scaling+Linear+Attention+with+Sparse+State+Expansion
17. AI Post Transformers: Jet-Nemotron and Post-Pretraining Model Acceleration — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-post-pretraining-model-4ba5cb.mp3
18. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, Sun,
https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/
19. AI Post Transformers: Dr.LLM: Dynamic Layer Routing in LLMs — Hal Turing & Dr. Ada Shannon, Wed,
https://podcast.do-not-panic.com/episodes/drllm-dynamic-layer-routing-in-llms/
20. AI Post Transformers: Speed Always Wins: Efficient Large Language Model Architectures — Hal Turing & Dr. Ada Shannon, Wed,
https://podcast.do-not-panic.com/episodes/speed-always-wins-efficient-large-language-model-architectures/
21. AI Post Transformers: LAQ for Smarter KV Cache Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-23-laq-for-smarter-kv-cache-eviction-3ea2b8.mp3
Interactive Visualization: Jet-Nemotron and PostNAS for Faster Long Context