AI Post Transformers

By mcgrof

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, pr... more

· Technology

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about AI Post Transformers:

How many episodes does AI Post Transformers have?

The podcast currently has 559 episodes available.

AI Post Transformers episodes:

March 26, 2026 Episode: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation
This episode explores how KV cache eviction shapes the speed and usability of long-context language models, focusing on the March 2026 paper LookaheadKV and the broader problem of managing transformer memory under tight GPU budgets. It explains why KV caches are essential for autoregressive decoding, why their linear growth becomes a major inference bottleneck, and how eviction policies differ from related approaches such as cache compression. The discussion highlights the paper’s central argument: future-aware eviction can outperform simple recency-based heuristics, but only if it avoids the heavy latency costs that make some draft-generation methods impractical. A listener would find it interesting for its clear systems-level view of transformer inference, especially the tradeoff between smarter cache decisions and time-to-first-token in real production settings.
Sources:
1. LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon, 2026
http://arxiv.org/abs/2603.10899v1
2. SnapKV: LLM Knows What You Are Looking for before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, 2024
https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+Are+Looking+for+before+Generation
3. Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query — Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che, 2025
https://scholar.google.com/scholar?q=Lookahead+Q-Cache:+Achieving+More+Consistent+KV+Cache+Eviction+via+Pseudo+Query
4. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 2025
https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction
5. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling — Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao, 2024
https://scholar.google.com/scholar?q=PyramidKV:+Dynamic+KV+Cache+Compression+based+on+Pyramidal+Information+Funneling
6. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — approx. recent RAG systems authors, 2025/2026
https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse
7. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent LLM inference authors, 2025/2026
https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse
8. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — approx. recent RAG/inference authors, 2025/2026
https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation
9. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference — approx. recent long-context inference authors, 2024/2025
https://scholar.google.com/scholar?q=LazyLLM:+Dynamic+Token+Pruning+for+Efficient+Long+Context+LLM+Inference
10. TokenButler: Token Importance Is Predictable — approx. recent token pruning authors, 2025
https://scholar.google.com/scholar?q=TokenButler:+Token+Importance+Is+Predictable
11. LongHeads: Multi-Head Attention Is Secretly a Long Context Processor — approx. recent mechanistic interpretability / long-context authors, 2025
https://scholar.google.com/scholar?q=LongHeads:+Multi-Head+Attention+Is+Secretly+a+Long+Context+Processor
12. How Transformers Implement Induction Heads: Approximation and Optimization Analysis — approx. mechanistic interpretability authors, 2024/2025
https://scholar.google.com/scholar?q=How+Transformers+Implement+Induction+Heads:+Approximation+and+Optimization+Analysis
13. AI Post Transformers: LAQ for Smarter KV Cache Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-23-laq-for-smarter-kv-cache-eviction-3ea2b8.mp3
14. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3
15. AI Post Transformers: Memory Traffic Saturation in Transformer Decode — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-20-memory-traffic-saturation-in-transformer-cd4961.mp3
16. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3
Interactive Visualization: Episode: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation
...more
0min
March 25, 2026 Episode: Speculative Speculative Decoding
This episode explores Speculative Speculative Decoding, a technique for reducing LLM inference latency by overlapping drafting and verification more aggressively than standard speculative decoding. It explains how the method predicts likely verification outcomes in advance so the draft model can prepare multiple next-step continuations while the target model is still checking the current block, with the target model still preserving the final output. The discussion focuses on the Saguaro algorithm, the distinction between true parallel generation and latency-hiding around autoregressive dependencies, and the practical tradeoff between useful overlap and wasted draft-side computation. A listener would find it interesting for its clear look at where modern inference systems still lose time and how smarter scheduling, rather than changing model semantics, can unlock additional speed.
Sources:
1. Speculative Speculative Decoding — Tanishq Kumar, Tri Dao, Avner May, 2026
http://arxiv.org/abs/2603.03251
2. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023
https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding
3. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024
https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty
4. SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification — Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia, 2024
https://scholar.google.com/scholar?q=SpecInfer:+Accelerating+Generative+LLM+Serving+with+Speculative+Inference+and+Token+Tree+Verification
5. PEARL: Parallel Speculative Decoding with Adaptive Draft Length — Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, Xiao Sun, 2025
https://scholar.google.com/scholar?q=PEARL:+Parallel+Speculative+Decoding+with+Adaptive+Draft+Length
6. AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration — Bradley McDanel, 2025
https://scholar.google.com/scholar?q=AMUSD:+Asynchronous+Multi-Device+Speculative+Decoding+for+LLM+Acceleration
7. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Jun Zhang et al., 2023
https://scholar.google.com/scholar?q=Draft+&+Verify:+Lossless+Large+Language+Model+Acceleration+via+Self-Speculative+Decoding
8. SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration — Heming Xia et al., 2024
https://scholar.google.com/scholar?q=SWIFT:+On-the-Fly+Self-Speculative+Decoding+for+LLM+Inference+Acceleration
9. SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths — Kaixuan Huang, Xudong Guo, Mengdi Wang, 2024
https://scholar.google.com/scholar?q=SpecDec++:+Boosting+Speculative+Decoding+via+Adaptive+Candidate+Lengths
10. Accelerating Transformer Inference for Translation via Parallel Decoding — Andrea Santilli et al., 2023
https://scholar.google.com/scholar?q=Accelerating+Transformer+Inference+for+Translation+via+Parallel+Decoding
11. Self-Speculative Decoding for LLM-based ASR with CTC Encoder Drafts — George Saon et al., 2026
https://scholar.google.com/scholar?q=Self-Speculative+Decoding+for+LLM-based+ASR+with+CTC+Encoder+Drafts
12. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/apples-speculative-streaming-fast-llm-inference-without-auxiliary-models/
13. AI Post Transformers: Adaptive Control for Batched Speculative Decoding in LLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/adaptive-control-for-batched-speculative-decoding-in-llm-serving/
14. AI Post Transformers: FastGRPO: Concurrency-Aware Speculative Decoding for Policy Optimization — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/fastgrpo-concurrency-aware-speculative-decoding-for-policy-optimization/
Interactive Visualization: Episode: Speculative Speculative Decoding
...more
0min
March 25, 2026 Lookahead Q-Cache for Consistent KV Eviction
This episode examines Lookahead Q-Cache as a very specific kind of inference optimization: a decode-stage KV-cache eviction method for long-context serving. The discussion explains the paper’s core claim that prefill-time attention is a weak proxy for what will matter once generation actually begins, because decode-time queries are conditioned on the answer the model is actively writing rather than on the prompt alone. That is the real novelty here. Static selection methods such as SnapKV and related heavy-hitter or cumulative-attention schemes mostly infer importance from prompt-side attention patterns, often using a suffix window as a stand-in for future need. Lookahead Q-Cache instead uses a pseudo-query to approximate upcoming decode queries, making eviction more dynamic and more aligned to generation. The hosts are explicit that this is mostly a decode-only idea, not a general cure for transformer inference cost, and they keep returning to that point so the scope is not overstated.
The conversation places the paper inside the broader acceleration landscape rather than treating it as a standalone breakthrough. Speculative decoding, Medusa-style multi-head prediction, and layered drafting ideas such as inference blending or Matryoshka-like speculative schemes all attack a different bottleneck: they try to reduce the cost of producing future tokens by drafting and verifying them more efficiently. Lookahead Q-Cache attacks the memory and attention burden of carrying long prefixes during decode. Those are not the same problem, which means they are not simple substitutes and can in principle be complementary in one serving stack. The episode also contrasts this test-time cache-management line with architecture-level efficiency work such as grouped-query attention, Nemotron 3 style system-model co-design, and Kimi-like efficient long-context efforts, where the gains often come from changing the model or attention structure rather than making smarter runtime eviction decisions.
The tone stays skeptical about deployment significance. The hosts ask the hard scaling question directly: does smarter KV eviction materially change long-context serving economics, or does it mainly deliver narrower decode wins inside a larger bottleneck stack that still includes prefill cost, bandwidth pressure, scheduler behavior, batching constraints, quantization tradeoffs, and model architecture limits? They argue that benchmark improvements in eviction consistency are interesting, but the real bar is whether operators would trust aggressive dynamic cache pruning in production compared with more predictable approaches like GQA, FlashAttention, quantization, or speculative decode pipelines already discussed elsewhere on the podcast. The result is a grounded episode about what is genuinely new in Lookahead Q-Cache, where it fits, and why decode-specific cache tricks should not be confused with a full solution to long-context serving.
Sources:
1. Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query — Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che, 2025
http://arxiv.org/abs/2505.20334
2. SnapKV: LLM Knows What You are Looking for Before Generation — Zhenyu Li et al., 2024
https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation
3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang et al., 2023
https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models
4. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time — Zhenyu Liu et al., 2023
https://scholar.google.com/scholar?q=Scissorhands:+Exploiting+the+Persistence+of+Importance+Hypothesis+for+LLM+KV+Cache+Compression+at+Test+Time
5. Fast and Accurate Transformer Decoding via Dynamic Compression of KV Cache — likely Tang et al., 2024
https://scholar.google.com/scholar?q=Fast+and+Accurate+Transformer+Decoding+via+Dynamic+Compression+of+KV+Cache
6. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao et al., 2022
https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness
7. FlashAttention-2 or subsequent FlashAttention work — Tri Dao, 2024
https://scholar.google.com/scholar?q=FlashAttention-2+or+subsequent+FlashAttention+work
8. RazorAttention: Efficient KV Cache Compression through Retrieval Heads — not recovered from snippet, 2024-2025
https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+through+Retrieval+Heads
9. FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation — not recovered from snippet, 2024-2025
https://scholar.google.com/scholar?q=FastKV:+KV+Cache+Compression+for+Fast+Long-Context+Processing+with+Token-Selective+Propagation
10. Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity — not recovered from snippet, 2024-2025
https://scholar.google.com/scholar?q=Compressing+KV+Cache+for+Long-Context+LLM+Inference+with+Inter-Layer+Attention+Similarity
11. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — not recovered from snippet, 2024-2025
https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse
12. DepCache: A KV Cache Management Framework for GraphRAG with Dependency Attention — not recovered from snippet, 2024-2025
https://scholar.google.com/scholar?q=DepCache:+A+KV+Cache+Management+Framework+for+GraphRAG+with+Dependency+Attention
13. End-to-End Acceleration of Generative Models with Runtime Regularized KV Cache Management — not recovered from snippet, 2024-2025
https://scholar.google.com/scholar?q=End-to-End+Acceleration+of+Generative+Models+with+Runtime+Regularized+KV+Cache+Management
14. AI Post Transformers: LAQ for Smarter KV Cache Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-23-laq-for-smarter-kv-cache-eviction-3ea2b8.mp3
15. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-21-lookaheadkv-fast-and-accurate-kv-c9d436.mp3
16. AI Post Transformers: Quest: Query-Aware Sparsity for Efficient LLM Inference — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/quest-query-aware-sparsity-for-efficient-llm-inference/
17. AI Post Transformers: Hyper-Scaling LLM Inference with KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/hyper-scaling-llm-inference-with-kv-cache-compression/
18. AI Post Transformers: Memory Traffic Saturation in Transformer Decode — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-20-memory-traffic-saturation-in-transformer-cd4961.mp3
Interactive Visualization: Lookahead Q-Cache for Consistent KV Eviction
...more
24min
March 25, 2026 Episode: Language Models are Injective and Hence Invertible
This episode explores a 2025 paper arguing that decoder-only language models are generically injective on discrete token sequences, meaning their hidden representations can in principle preserve enough information to recover the exact original prompt. It walks through what injectivity and invertibility mean in this setting, why that challenges the common intuition that transformer representations behave like lossy semantic summaries, and how the paper distinguishes this claim from stronger notions of full bijectivity over continuous spaces. The discussion also connects the result to related ideas from normalizing flows, reversible networks, and mechanistic interpretability, while introducing the paper’s constructive recovery method, SipIt. Listeners would find it interesting because the result has unusually sharp implications for both interpretability and privacy: hidden states may be far less abstracted from raw input text than many researchers assume.
Sources:
1. Language Models are Injective and Hence Invertible — Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodolà, 2025
http://arxiv.org/abs/2510.15511
2. Normalizing Flows for Probabilistic Modeling and Inference — George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, Balaji Lakshminarayanan, 2021
https://scholar.google.com/scholar?q=Normalizing+Flows+for+Probabilistic+Modeling+and+Inference
3. The Reversible Residual Network: Backpropagation Without Storing Activations — Aidan N. Gomez, Mengye Ren, Raquel Urtasun, Roger B. Grosse, 2017
https://scholar.google.com/scholar?q=The+Reversible+Residual+Network:+Backpropagation+Without+Storing+Activations
4. Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence — Haoran Li, Mingshi Xu, Yangqiu Song, 2023
https://scholar.google.com/scholar?q=Sentence+Embedding+Leaks+More+Information+than+You+Expect:+Generative+Embedding+Inversion+Attack+to+Recover+the+Whole+Sentence
5. Language Models are Injective and Hence Invertible — Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodola, 2025
https://scholar.google.com/scholar?q=Language+Models+are+Injective+and+Hence+Invertible
6. The non-linear representation dilemma: Is causal abstraction enough for mechanistic interpretability? — Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel, 2025
https://scholar.google.com/scholar?q=The+non-linear+representation+dilemma:+Is+causal+abstraction+enough+for+mechanistic+interpretability?
7. On surjectivity of neural networks: Can you elicit any behavior from your model? — Haozhe Jiang, Nika Haghtalab, 2025
https://scholar.google.com/scholar?q=On+surjectivity+of+neural+networks:+Can+you+elicit+any+behavior+from+your+model?
8. Attention is not all you need: Pure attention loses rank doubly exponentially with depth — Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas, 2021
https://scholar.google.com/scholar?q=Attention+is+not+all+you+need:+Pure+attention+loses+rank+doubly+exponentially+with+depth
9. Text embeddings reveal (almost) as much as text — John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, Alexander M. Rush, 2023
https://scholar.google.com/scholar?q=Text+embeddings+reveal+(almost)+as+much+as+text
10. Language model inversion — John X. Morris, Wenting Zhao, Justin T. Chiu, Vitaly Shmatikov, Alexander M. Rush, 2023
https://scholar.google.com/scholar?q=Language+model+inversion
11. Better language model inversion by compactly representing next-token distributions — Murtaza Nazir, Matthew Finlayson, John X. Morris, Xiang Ren, Swabha Swayamdipta, 2025
https://scholar.google.com/scholar?q=Better+language+model+inversion+by+compactly+representing+next-token+distributions
12. Stabilizing Transformer Training by Preventing Attention Entropy Collapse — Shuangfei Zhai et al., 2023
https://scholar.google.com/scholar?q=Stabilizing+Transformer+Training+by+Preventing+Attention+Entropy+Collapse
13. From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics — Zheng-An Chen and Tao Luo, 2025
https://scholar.google.com/scholar?q=From+Condensation+to+Rank+Collapse:+A+Two-Stage+Analysis+of+Transformer+Training+Dynamics
14. Understanding and Minimising Outlier Features in Transformer Training — Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, Thomas Hofmann, 2024
https://scholar.google.com/scholar?q=Understanding+and+Minimising+Outlier+Features+in+Transformer+Training
15. Measuring In-Context Computation Complexity via Hidden State Prediction — Vincent Herrmann, Robert Csordas, Jurgen Schmidhuber, 2025
https://scholar.google.com/scholar?q=Measuring+In-Context+Computation+Complexity+via+Hidden+State+Prediction
16. Transformers without Normalization — Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu, 2025
https://scholar.google.com/scholar?q=Transformers+without+Normalization
17. AI Post Transformers: Language Models are Injective and Hence Invertible — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-21-language-models-are-injective-an-7545e0.mp3
18. AI Post Transformers: RoPE — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/rope/
19. AI Post Transformers: Mistral 7B: Superior Performance in a Smaller Package — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/mistral-7b-superior-performance-in-a-smaller-package/
20. AI Post Transformers: ALiBi: Attention with Linear Biases Enables Length Extrapolation — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/alibi-attention-with-linear-biases-enables-length-extrapolation/
21. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3
Interactive Visualization: Episode: Language Models are Injective and Hence Invertible
...more
0min
March 25, 2026 Jet-Nemotron and PostNAS for Faster Long Context
Hal Turing and Dr. Ada Shannon examine Jet-Nemotron as a serious but narrow attempt to retrofit long-context efficiency into a pretrained dense Transformer rather than as a clean-sheet architecture revolution. They focus on NVIDIA’s PostNAS pipeline, which freezes the MLP pathway, treats attention layers as the remodel zone, and searches where full attention is still worth paying for versus where cheaper JetBlocks can replace it. The discussion keeps returning to the real question behind the paper’s marketing: whether this is evidence that linear-attention-style hybrids can genuinely change inference scaling and KV-cache pressure, or whether it is a carefully engineered optimization for a constrained deployment target that inherits most of its intelligence from the original dense model.
The episode makes the contrast with Nemotron 3 explicit. In the earlier Nemotron 3 story, the architectural pitch was a broader hybrid stack built around the interplay of dense Transformer machinery, mixture-of-experts routing, and state-space or recurrent-style efficiency ideas. Jet-Nemotron is different in both method and claim: it is not mainly about MoE capacity or an SSM-flavored redesign, but about post-training surgery on the attention stack itself, with layer placement search deciding where exact global lookup remains indispensable and where linear-style blocks can take over. That makes Jet-Nemotron feel less like a new foundation model family and more like a practical conversion recipe, which the hosts treat as both the paper’s most credible contribution and its main limitation.
They also place Jet-Nemotron directly against Kimi Linear and the broader efficient-LLM landscape. Both papers take linear attention seriously as a way to attack long-context serving bottlenecks, but the comparison here is not flattering by default: Kimi Linear looked more like a direct argument for a new sequence-mixing primitive, while Jet-Nemotron looks more convincing as an engineering workflow for salvaging pretrained dense checkpoints without retraining everything from scratch. The hosts parse where the similarities end, where the quality-preservation story still depends on keeping some full-attention layers alive, and why that matters for judging whether linear attention is becoming a real architectural shift or remains a selective compromise that works best when a dense Transformer still anchors the system.
Sources:
1. Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search — Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai, 2025
http://arxiv.org/abs/2508.15884
2. Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models — Soham De, Samuel L. Smith, Aleksandar Botev, Albert Gu, Caglar Gulcehre and collaborators, 2024
https://scholar.google.com/scholar?q=Griffin:+Mixing+Gated+Linear+Recurrences+with+Local+Attention+for+Efficient+Language+Models
3. Zamba: A Compact 7B SSM Hybrid Model — Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Beren Millidge and collaborators, 2024
https://scholar.google.com/scholar?q=Zamba:+A+Compact+7B+SSM+Hybrid+Model
4. Hymba: A Hybrid-head Architecture for Small Language Models — Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Pavlo Molchanov and collaborators, 2025
https://scholar.google.com/scholar?q=Hymba:+A+Hybrid-head+Architecture+for+Small+Language+Models
5. Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models — NVIDIA et al. (including Aaron Blakeman, Song Han, Jan Kautz and collaborators), 2025
https://scholar.google.com/scholar?q=Nemotron-H:+A+Family+of+Accurate+and+Efficient+Hybrid+Mamba-Transformer+Models
6. Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search — Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai, 2025
https://scholar.google.com/scholar?q=Jet-Nemotron:+Efficient+Language+Model+with+Post+Neural+Architecture+Search
7. Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction — Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, Yusuke Oishi, 2026
https://scholar.google.com/scholar?q=Distill-then-Replace:+Efficient+Task-Specific+Hybrid+Attention+Model+Construction
8. The Zamba2 Suite: Technical Report — Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, Beren Millidge, 2024
https://scholar.google.com/scholar?q=The+Zamba2+Suite:+Technical+Report
9. RecurrentGemma: Moving Past Transformers for Efficient Open Language Models — Aleksandar Botev, Soham De, Samuel L. Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Leonhard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, and others, 2024
https://scholar.google.com/scholar?q=RecurrentGemma:+Moving+Past+Transformers+for+Efficient+Open+Language+Models
10. Zoology: Measuring and Improving Recall in Efficient Language Models — Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Re, 2023
https://scholar.google.com/scholar?q=Zoology:+Measuring+and+Improving+Recall+in+Efficient+Language+Models
11. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024
https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality
12. Eigen Attention: Attention in Low-Rank Space for KV Cache Compression — approx. recent LLM systems/efficient inference authors, 2024/2025
https://scholar.google.com/scholar?q=Eigen+Attention:+Attention+in+Low-Rank+Space+for+KV+Cache+Compression
13. ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering — approx. recent efficient inference authors, 2024/2025
https://scholar.google.com/scholar?q=ClusterAttn:+KV+Cache+Compression+under+Intrinsic+Attention+Clustering
14. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution — approx. recent efficient inference authors, 2024/2025
https://scholar.google.com/scholar?q=Expected+Attention:+KV+Cache+Compression+by+Estimating+Attention+from+Future+Queries+Distribution
15. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — approx. recent hybrid-attention authors, 2024/2025
https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning
16. Scaling Linear Attention with Sparse State Expansion — approx. recent linear-attention scaling authors, 2024/2025
https://scholar.google.com/scholar?q=Scaling+Linear+Attention+with+Sparse+State+Expansion
17. AI Post Transformers: Jet-Nemotron and Post-Pretraining Model Acceleration — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-post-pretraining-model-4ba5cb.mp3
18. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, Sun,
https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/
19. AI Post Transformers: Dr.LLM: Dynamic Layer Routing in LLMs — Hal Turing & Dr. Ada Shannon, Wed,
https://podcast.do-not-panic.com/episodes/drllm-dynamic-layer-routing-in-llms/
20. AI Post Transformers: Speed Always Wins: Efficient Large Language Model Architectures — Hal Turing & Dr. Ada Shannon, Wed,
https://podcast.do-not-panic.com/episodes/speed-always-wins-efficient-large-language-model-architectures/
21. AI Post Transformers: LAQ for Smarter KV Cache Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-23-laq-for-smarter-kv-cache-eviction-3ea2b8.mp3
Interactive Visualization: Jet-Nemotron and PostNAS for Faster Long Context
...more
13min
March 20, 2026 New Site Features: Conferences Tracking and How to Support the Show
Special announcement: AI Post Transformers now has a Conferences section that tracks the AI conferences and papers we have covered, plus a new Sponsor page for listeners who want to help fund credits and infrastructure that keep the show running.
Interactive Visualization: New Site Features: Conferences Tracking and How to Support the Show
...more
3min
March 19, 2026 Optimizing Mixture of Block Attention Through Statistical Theory
This episode examines the statistical foundations of Mixture of Block Attention (MoBA), a sparse attention mechanism that divides key-value sequences into blocks and routes queries only to the most relevant ones. The paper derives a signal-to-noise ratio showing that retrieval accuracy depends on the square root of head dimension divided by block size, revealing why smaller blocks improve a router's ability to distinguish relevant from irrelevant content despite increasing computational overhead. The authors introduce FlashMoBA, a hardware-optimized CUDA kernel that makes small block sizes practical on GPUs, and demonstrate how depthwise convolutions on keys can cluster related signals to further boost routing performance. The work provides theoretical grounding for why routing-based sparse attention succeeds at reducing quadratic attention costs to near-linear scaling in long-context language models.
Sources:
1. Optimizing Mixture of Block Attention — Guangxuan Xiao, Junxian Guo, Kasra Mazaheri, Song Han, 2025
http://arxiv.org/abs/2511.11571v2
2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao et al., 2022
https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness
3. Mixture of Experts: A Survey — Various (MoE literature), 2020-2024
https://scholar.google.com/scholar?q=Mixture+of+Experts:+A+Survey
4. Sparse Attention Mechanisms (Zaheer et al., Guo et al., Xu et al.) — Cited in paper, 2020-2025
https://scholar.google.com/scholar?q=Sparse+Attention+Mechanisms+(Zaheer+et+al.,+Guo+et+al.,+Xu+et+al.)
5. AI Post Transformers: Optimizing Mixture of Block Attention for Long-Context Transformers — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-optimizing-mixture-of-block-attention-fo-ea4612.mp3
6. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3
7. AI Post Transformers: Bidaw: Bidirectional Awareness for Interactive LLM KV Caching — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-bidaw-bidirectional-awareness-for-intera-87c311.mp3
Interactive Visualization: Optimizing Mixture of Block Attention Through Statistical Theory
...more
0min
March 19, 2026 Xerxes: CXL 3.0 Simulation for Scalable Memory Systems
This episode explores Xerxes, a new open-source simulator designed to model CXL 3.0 features before the hardware exists. The hosts explain how CXL adds cache coherence to PCIe to solve memory access bottlenecks in AI and HPC workloads, then dive into the two major architectural changes in CXL 3.0: Port-Based Routing, which enables arbitrary fabric topologies beyond rigid trees, and Device-Managed Coherence, which lets devices handle coherence protocols peer-to-peer without routing every transaction through the host CPU. The discussion highlights why this simulator matters for designing next-generation rack-scale memory pools and accelerator fabrics, addressing the chicken-and-egg problem of validating designs before physical hardware ships. The hosts question how validation works without reference hardware and preview a deeper look at Xerxes' architecture and methodology.
Sources:
1. Xerxes: CXL 3.0 Simulation for Scalable Memory Systems
https://www.usenix.org/system/files/fast26-an.pdf
2. CXL Memory Disaggregation: Opportunities and Challenges — Guz et al. (Intel), 2023
https://scholar.google.com/scholar?q=CXL+Memory+Disaggregation:+Opportunities+and+Challenges
3. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms — Li et al., 2023
https://scholar.google.com/scholar?q=Pond:+CXL-Based+Memory+Pooling+Systems+for+Cloud+Platforms
4. TPP: Transparent Page Placement for CXL-Enabled Tiered Memory — Maruf et al., 2023
https://scholar.google.com/scholar?q=TPP:+Transparent+Page+Placement+for+CXL-Enabled+Tiered+Memory
5. The CXL Memory Expander: Performance and Cost Analysis — Gouk et al. (SK hynix), 2023
https://scholar.google.com/scholar?q=The+CXL+Memory+Expander:+Performance+and+Cost+Analysis
6. Exploring CXL 3.0 Port-Based Routing for Scalable Memory Systems — Pan et al., 2024
https://scholar.google.com/scholar?q=Exploring+CXL+3.0+Port-Based+Routing+for+Scalable+Memory+Systems
7. SMART: Scalable Memory Architecture with Port-Based Routing — Kim et al., 2024
https://scholar.google.com/scholar?q=SMART:+Scalable+Memory+Architecture+with+Port-Based+Routing
8. Deadlock-Free Routing for CXL Fabrics — Zhang et al., 2024
https://scholar.google.com/scholar?q=Deadlock-Free+Routing+for+CXL+Fabrics
9. DMC: Distributed Cache Coherence for CXL Memory Systems — Lee et al., 2024
https://scholar.google.com/scholar?q=DMC:+Distributed+Cache+Coherence+for+CXL+Memory+Systems
10. Scaling Cache Coherence to Thousands of Devices with CXL DMC — Wang et al., 2024
https://scholar.google.com/scholar?q=Scaling+Cache+Coherence+to+Thousands+of+Devices+with+CXL+DMC
11. Coherence Protocol Verification for CXL Device-Managed Coherence — Chen et al., 2024
https://scholar.google.com/scholar?q=Coherence+Protocol+Verification+for+CXL+Device-Managed+Coherence
12. gem5: A Multiple-ISA Full-System Simulator — Binkert et al., 2011
https://scholar.google.com/scholar?q=gem5:+A+Multiple-ISA+Full-System+Simulator
13. The ZSim Simulator: Fast and Accurate Multicore Simulation — Sanchez and Kozyrakis, 2013
https://scholar.google.com/scholar?q=The+ZSim+Simulator:+Fast+and+Accurate+Multicore+Simulation
14. Simulating Multi-Core Systems with Shared Memory Coherence — Martin et al. (Wisconsin Multifacet group), 2005
https://scholar.google.com/scholar?q=Simulating+Multi-Core+Systems+with+Shared+Memory+Coherence
15. PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectures — Fuchs et al., 2020
https://scholar.google.com/scholar?q=PARADE:+A+Cycle-Accurate+Full-System+Simulation+Platform+for+Accelerator-Rich+Architectures
16. A Primer on Memory Consistency and Cache Coherence — Sorin, Hill, and Wood, 2011
https://scholar.google.com/scholar?q=A+Primer+on+Memory+Consistency+and+Cache+Coherence
17. Coherence and Consistency Models in Shared-Memory Multiprocessors — Adve and Gharachorloo, 1996
https://scholar.google.com/scholar?q=Coherence+and+Consistency+Models+in+Shared-Memory+Multiprocessors
18. DASH: A Scalable Directory-Based Multiprocessor — Lenoski et al. (Stanford DASH project), 1992
https://scholar.google.com/scholar?q=DASH:+A+Scalable+Directory-Based+Multiprocessor
19. Directory-Based Cache Coherence in Large-Scale Multiprocessors — Chaiken et al. (Alewife project), 1991
https://scholar.google.com/scholar?q=Directory-Based+Cache+Coherence+in+Large-Scale+Multiprocessors
20. Enabling Rack-Scale Confidential Computing using Heterogeneous Trusted Execution Environment — Jianping Zhu, Hang Yin, Yuekai Jia, Wenhao Wang, Chunhui Li, Jiashuo Liang, Shoumeng Yan, Zhengyu He, Qingkui Liu, Alex X. Liu, 2024
https://scholar.google.com/scholar?q=Enabling+Rack-Scale+Confidential+Computing+using+Heterogeneous+Trusted+Execution+Environment
21. Understanding the Overheads of Hardware Memory Coherence — Lena E. Olson, Joseph Izraelevitz, Mark D. Hill, 2015
https://scholar.google.com/scholar?q=Understanding+the+Overheads+of+Hardware+Memory+Coherence
22. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3
23. AI Post Transformers: Accelerating LLM Cold Starts with Programmable Page Cache — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-accelerating-llm-cold-starts-with-progra-0912d1.mp3
24. AI Post Transformers: xLLM: Co-Locating Online and Offline LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-xllm-co-locating-online-and-offline-llm-10bb81.mp3
Interactive Visualization: Xerxes: CXL 3.0 Simulation for Scalable Memory Systems
...more
0min
March 19, 2026 Generative File Systems: Replacing Code with Formal Specifications
This episode explores a 2026 USENIX FAST paper that proposes replacing hand-written file system code with LLM-generated implementations derived from formal specifications. The authors demonstrate SYSSPEC, a system that uses three types of formal specifications—Hoare logic for functionality, rely-guarantee conditions for modularity, and explicit concurrency protocols—to guide code generation while using validation agents to catch hallucinations and ensure correctness. Analysis of Ext4's commit history reveals that 82.4% of changes are bug fixes and maintenance, suggesting traditional file system development wastes enormous effort on code upkeep rather than innovation. The researchers show that their approach can generate a working file system (SPECFS) and evolve it by patching specifications rather than code, potentially transforming how systems software is developed and maintained.
Sources:
1. Generative File Systems: Replacing Code with Formal Specifications
https://www.usenix.org/system/files/fast26-liu-qingyuan.pdf
2. Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale — Fabrice Popineau, Artem Vysogorets, et al., 2020
https://scholar.google.com/scholar?q=Yggdrasil:+An+Optimized+System+for+Training+Deep+Decision+Trees+at+Scale
3. Hyperkernel: Push-Button Verification of an OS Kernel — Luke Nelson, Helgi Sigurbjarnarson, Kaiyuan Zhang, et al., 2017
https://scholar.google.com/scholar?q=Hyperkernel:+Push-Button+Verification+of+an+OS+Kernel
4. Program Synthesis from Natural Language Using Recurrent Neural Networks — Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst, 2017
https://scholar.google.com/scholar?q=Program+Synthesis+from+Natural+Language+Using+Recurrent+Neural+Networks
5. Crash Hoare Logic — Tej Chajed, Frans Kaashoek, Butler Lampson, Nickolai Zeldovich, 2018
https://scholar.google.com/scholar?q=Crash+Hoare+Logic
6. FSCQ: A Verified File System — Haogang Chen et al., 2015
https://scholar.google.com/scholar?q=FSCQ:+A+Verified+File+System
7. Yxv6: An Educational File System with Formal Specifications — Helgi Sigurbjarnarson et al., 2016
https://scholar.google.com/scholar?q=Yxv6:+An+Educational+File+System+with+Formal+Specifications
8. Crash Consistency in Database Systems — Goetz Graefe, 2009
https://scholar.google.com/scholar?q=Crash+Consistency+in+Database+Systems
9. Using Crash Hoare Logic for Certifying the FSCQ File System — Haogang Chen et al., 2015
https://scholar.google.com/scholar?q=Using+Crash+Hoare+Logic+for+Certifying+the+FSCQ+File+System
10. Jitk: A Trustworthy In-Kernel Interpreter Infrastructure — Xi Wang et al., 2014
https://scholar.google.com/scholar?q=Jitk:+A+Trustworthy+In-Kernel+Interpreter+Infrastructure
11. AI Post Transformers: LLM Agents Reason About Code Without Running It — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-15-llm-agents-reason-about-code-without-run-2a1876.mp3
12. AI Post Transformers: SYSSPEC: LLM-Generated File Systems from Formal Specifications — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-sysspec-llm-generated-file-systems-from-02f5a9.mp3
13. AI Post Transformers: Generative File Systems from Formal Specifications with SysSpec — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-generative-file-systems-from-formal-spec-ff240b.mp3
14. AI Post Transformers: Sharpen the Spec, Cut the Code: LLM-Generated File Systems — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-sharpen-the-spec-cut-the-code-llm-genera-8eb6b1.mp3
Interactive Visualization: Generative File Systems: Replacing Code with Formal Specifications
...more
0min
March 19, 2026 SolidAttention: Co-Designing Sparse Attention and SSD I/O
This episode explores SolidAttention, a system that enables large language models to run on memory-constrained consumer PCs by offloading the KV cache to SSD storage. The paper addresses a fundamental mismatch: sparse attention patterns create random I/O access that kills SSD performance, while previous offloading solutions like FlexGen only work well with high request concurrency unavailable on local machines. The researchers co-designed sparse attention algorithms with SSD storage management to enable coarse-grained sequential reads instead of fine-grained random access, achieving practical local LLM inference on systems with just 8-16GB of RAM. The discussion covers why KV caches consume four times the memory of model weights, the trade-offs of quantization versus offloading, and why treating attention sparsity and storage optimization as separate problems fails on consumer hardware.
Sources:
1. SolidAttention: Co-Designing Sparse Attention and SSD I/O
https://www.usenix.org/system/files/fast26-zheng.pdf
2. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Sheng et al., 2023
https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU
3. Efficient Streaming Language Models with Attention Sinks — Xiao et al., 2024
https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks
4. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhang et al., 2023
https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models
5. SSD I/O Characteristics: Impacts of Request Size, Access Pattern, and Parallelism — Chen et al., 2016
https://scholar.google.com/scholar?q=SSD+I/O+Characteristics:+Impacts+of+Request+Size,+Access+Pattern,+and+Parallelism
6. vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., 2023
https://scholar.google.com/scholar?q=vLLM:+Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention
7. AI Post Transformers: SolidAttention: Efficient SSD-based KV Cache Offloading for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-solidattention-efficient-ssd-based-kv-ca-336b79.mp3
8. AI Post Transformers: SolidAttention: Fast SSD-Based Serving on Memory-Constrained PCs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-solidattention-fast-ssd-based-serving-on-1c305d.mp3
9. AI Post Transformers: SolidAttention: Low-Latency SSD-based Serving on Memory-Constrained PCs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-solidattention-low-latency-ssd-based-ser-e22a0d.mp3
10. AI Post Transformers: Bidaw: Bidirectional Awareness for Interactive LLM KV Caching — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-bidaw-bidirectional-awareness-for-intera-87c311.mp3
11. AI Post Transformers: Bidaw: Reducing LLM KV Cache Latency with Two-Tier Storage — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-bidaw-reducing-llm-kv-cache-latency-with-15dd25.mp3
12. AI Post Transformers: Bidaw: Computation-Storage Aware KV Caching for LLMs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-bidaw-computation-storage-aware-kv-cachi-9d89fb.mp3
13. AI Post Transformers: CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for LLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-cacheslide-unlocking-cross-position-awar-487b2b.mp3
14. AI Post Transformers: Efficient KV Cache Reuse in Dynamic Agent Workflows — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-efficient-kv-cache-reuse-in-dynamic-agen-558f19.mp3
15. AI Post Transformers: Accelerating LLM Cold Starts with Programmable Page Cache — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-accelerating-llm-cold-starts-with-progra-0912d1.mp3
16. AI Post Transformers: LLM Cold Starts: Fixing Linux Page Cache for Model Loading — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-pacific.com/episodes/2026-03-17-llm-cold-starts-fixing-linux-page-cache-a9f9a9.mp3
Interactive Visualization: SolidAttention: Co-Designing Sparse Attention and SSD I/O
...more
0min

FAQs about AI Post Transformers:

How many episodes does AI Post Transformers have?

The podcast currently has 559 episodes available.