This episode explores a 2025 arXiv paper proposing “KVNAND,” an on-device LLM inference system that stores both model weights and the attention KV cache in compute-enabled 3D NAND flash to reduce or eliminate reliance on external DRAM. The discussion explains why decode-time generation is often bottlenecked by memory movement rather than raw compute, and argues that the KV cache—not just model weights—has become a major systems problem for long-context inference. It also examines whether the paper’s “DRAM-free” claim is technically convincing, especially given how KV cache costs vary across attention designs like MHA, GQA, and MQA. A listener would find it interesting for its concrete look at hardware-software tradeoffs in local LLM deployment and its skepticism about whether flashy architectural claims hold up under realistic workloads.
Sources:
1. KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing — Lishuo Deng, Shaojie Xu, Jinwu Chen, Changwei Yan, Jiajie Wang, Zhe Jiang, Weiwei Shan, 2025
http://arxiv.org/abs/2512.03608
2. A Survey of Processing-in-Memory: Techniques, Applications, and Challenges — Seyed H. N. Fatemi Langroudi and others, 2024
https://scholar.google.com/scholar?q=A+Survey+of+Processing-in-Memory:+Techniques,+Applications,+and+Challenges
3. Computational Storage: Where Are We Today? — Keith Townsend, Nils Bjerregaard, Javier Gonzalez and others, 2022
https://scholar.google.com/scholar?q=Computational+Storage:+Where+Are+We+Today?
4. Cambricon-LLM: Memory-Efficient Large Language Model Inference with Compute-Enabled Flash Memory — Main authors from the Cambricon research team, 2024
https://scholar.google.com/scholar?q=Cambricon-LLM:+Memory-Efficient+Large+Language+Model+Inference+with+Compute-Enabled+Flash+Memory
5. Lincoln: Accelerating Long-Context LLM Inference with In-Flash Computing — Main authors from the Lincoln research team, 2024
https://scholar.google.com/scholar?q=Lincoln:+Accelerating+Long-Context+LLM+Inference+with+In-Flash+Computing
6. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU — Ying Sheng, Yuzhang Wang, Beidi Chen and others, 2024
https://scholar.google.com/scholar?q=PowerInfer:+Fast+Large+Language+Model+Serving+with+a+Consumer-grade+GPU
7. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory — Keiichi Yao, Siddharth Joshi, Priya Goyal and others, 2024
https://scholar.google.com/scholar?q=LLM+in+a+Flash:+Efficient+Large+Language+Model+Inference+with+Limited+Memory
8. Speculative Decoding for Accelerating Large Language Model Inference — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023
https://scholar.google.com/scholar?q=Speculative+Decoding+for+Accelerating+Large+Language+Model+Inference
9. PagedAttention: Efficient Memory Management for Large Language Model Serving with Paged KV Cache — Woosuk Kwon, Zhihong Shen, Siyuan Zhuang and others, 2023
https://scholar.google.com/scholar?q=PagedAttention:+Efficient+Memory+Management+for+Large+Language+Model+Serving+with+Paged+KV+Cache
10. Cambricon-LLM — Not specified in the provided excerpt, Likely 2024-2025
https://scholar.google.com/scholar?q=Cambricon-LLM
11. Lincoln — Not specified in the provided excerpt, Likely 2024-2025
https://scholar.google.com/scholar?q=Lincoln
12. LLaMA 2 — Hugo Touvron et al., 2023
https://scholar.google.com/scholar?q=LLaMA+2
13. Llama 3.1 — Meta AI, 2024
https://scholar.google.com/scholar?q=Llama+3.1
14. PagedAttention / vLLM — Woosuk Kwon et al., 2023
https://scholar.google.com/scholar?q=PagedAttention+/+vLLM
15. FlashAttention — Tri Dao et al., 2022
https://scholar.google.com/scholar?q=FlashAttention
16. MQA/GQA transformer variants such as GQA in Llama-family models — Various, 2023-2024
https://scholar.google.com/scholar?q=MQA/GQA+transformer+variants+such+as+GQA+in+Llama-family+models
17. Computational storage / near-data processing in SSDs — Various, 2019-2024
https://scholar.google.com/scholar?q=Computational+storage+/+near-data+processing+in+SSDs
18. RazorAttention: Efficient KV Cache Compression through Retrieval Heads — approx. recent LLM systems/ML authors, 2024/2025
https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+through+Retrieval+Heads
19. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. recent LLM efficiency authors, 2024/2025
https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning
20. AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models — approx. recent LLM inference authors, 2024/2025
https://scholar.google.com/scholar?q=AhaKV:+Adaptive+Holistic+Attention-Driven+KV+Cache+Eviction+for+Efficient+Inference+of+Large+Language+Models
21. G-KV: Decoding-Time KV Cache Eviction with Global Attention — approx. recent LLM efficiency authors, 2024/2025
https://scholar.google.com/scholar?q=G-KV:+Decoding-Time+KV+Cache+Eviction+with+Global+Attention
22. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference — approx. recent LLM inference authors, 2024/2025
https://scholar.google.com/scholar?q=LLMs+Know+What+to+Drop:+Self-Attention+Guided+KV+Cache+Eviction+for+Efficient+Long-Context+Inference
23. Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-Level Caching — approx. recent systems authors, 2024/2025
https://scholar.google.com/scholar?q=Harnessing+Your+DRAM+and+SSD+for+Sustainable+and+Accessible+LLM+Inference+with+Mixed-Precision+and+Multi-Level+Caching
24. Efficient LLM Inference Using Dynamic Input Pruning and Cache-Aware Masking — approx. recent mobile/edge inference authors, 2024/2025
https://scholar.google.com/scholar?q=Efficient+LLM+Inference+Using+Dynamic+Input+Pruning+and+Cache-Aware+Masking
25. SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving — approx. recent edge serving authors, 2024/2025
https://scholar.google.com/scholar?q=SLED:+A+Speculative+LLM+Decoding+Framework+for+Efficient+Edge+Serving
26. DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding — approx. recent edge/cloud systems authors, 2024/2025
https://scholar.google.com/scholar?q=DSSD:+Efficient+Edge-Device+LLM+Deployment+and+Collaborative+Inference+via+Distributed+Split+Speculative+Decoding
27. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3
28. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3
29. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3
30. AI Post Transformers: QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/quantspec-hierarchical-kv-cache-for-self-speculative-decoding/
31. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/cxl-speckv-bridging-the-llm-memory-wall-with-speculative-fpga-disaggregation/
32. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/
33. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/
34. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3
Interactive Visualization: DRAM-Free In-Flash Computing for LLM Inference