April 10, 2026

DRAM-Free In-Flash Computing for LLM Inference

28 minutes

This episode explores a 2025 arXiv paper proposing “KVNAND,” an on-device LLM inference system that stores both model weights and the attention KV cache in compute-enabled 3D NAND flash to reduce or eliminate reliance on external DRAM. The discussion explains why decode-time generation is often bottlenecked by memory movement rather than raw compute, and argues that the KV cache—not just model weights—has become a major systems problem for long-context inference. It also examines whether the paper’s “DRAM-free” claim is technically convincing, especially given how KV cache costs vary across attention designs like MHA, GQA, and MQA. A listener would find it interesting for its concrete look at hardware-software tradeoffs in local LLM deployment and its skepticism about whether flashy architectural claims hold up under realistic workloads.

Sources:

1. KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing — Lishuo Deng, Shaojie Xu, Jinwu Chen, Changwei Yan, Jiajie Wang, Zhe Jiang, Weiwei Shan, 2025

http://arxiv.org/abs/2512.03608

2. A Survey of Processing-in-Memory: Techniques, Applications, and Challenges — Seyed H. N. Fatemi Langroudi and others, 2024

https://scholar.google.com/scholar?q=A+Survey+of+Processing-in-Memory:+Techniques,+Applications,+and+Challenges

3. Computational Storage: Where Are We Today? — Keith Townsend, Nils Bjerregaard, Javier Gonzalez and others, 2022

https://scholar.google.com/scholar?q=Computational+Storage:+Where+Are+We+Today?

4. Cambricon-LLM: Memory-Efficient Large Language Model Inference with Compute-Enabled Flash Memory — Main authors from the Cambricon research team, 2024

https://scholar.google.com/scholar?q=Cambricon-LLM:+Memory-Efficient+Large+Language+Model+Inference+with+Compute-Enabled+Flash+Memory

5. Lincoln: Accelerating Long-Context LLM Inference with In-Flash Computing — Main authors from the Lincoln research team, 2024

https://scholar.google.com/scholar?q=Lincoln:+Accelerating+Long-Context+LLM+Inference+with+In-Flash+Computing

6. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU — Ying Sheng, Yuzhang Wang, Beidi Chen and others, 2024

https://scholar.google.com/scholar?q=PowerInfer:+Fast+Large+Language+Model+Serving+with+a+Consumer-grade+GPU

7. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory — Keiichi Yao, Siddharth Joshi, Priya Goyal and others, 2024

https://scholar.google.com/scholar?q=LLM+in+a+Flash:+Efficient+Large+Language+Model+Inference+with+Limited+Memory

8. Speculative Decoding for Accelerating Large Language Model Inference — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023

https://scholar.google.com/scholar?q=Speculative+Decoding+for+Accelerating+Large+Language+Model+Inference

9. PagedAttention: Efficient Memory Management for Large Language Model Serving with Paged KV Cache — Woosuk Kwon, Zhihong Shen, Siyuan Zhuang and others, 2023

https://scholar.google.com/scholar?q=PagedAttention:+Efficient+Memory+Management+for+Large+Language+Model+Serving+with+Paged+KV+Cache

10. Cambricon-LLM — Not specified in the provided excerpt, Likely 2024-2025

https://scholar.google.com/scholar?q=Cambricon-LLM

11. Lincoln — Not specified in the provided excerpt, Likely 2024-2025

https://scholar.google.com/scholar?q=Lincoln

12. LLaMA 2 — Hugo Touvron et al., 2023

https://scholar.google.com/scholar?q=LLaMA+2

13. Llama 3.1 — Meta AI, 2024

https://scholar.google.com/scholar?q=Llama+3.1

14. PagedAttention / vLLM — Woosuk Kwon et al., 2023

https://scholar.google.com/scholar?q=PagedAttention+/+vLLM

15. FlashAttention — Tri Dao et al., 2022

https://scholar.google.com/scholar?q=FlashAttention

16. MQA/GQA transformer variants such as GQA in Llama-family models — Various, 2023-2024

https://scholar.google.com/scholar?q=MQA/GQA+transformer+variants+such+as+GQA+in+Llama-family+models

17. Computational storage / near-data processing in SSDs — Various, 2019-2024

https://scholar.google.com/scholar?q=Computational+storage+/+near-data+processing+in+SSDs

18. RazorAttention: Efficient KV Cache Compression through Retrieval Heads — approx. recent LLM systems/ML authors, 2024/2025

https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+through+Retrieval+Heads

19. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. recent LLM efficiency authors, 2024/2025

https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning

20. AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models — approx. recent LLM inference authors, 2024/2025

https://scholar.google.com/scholar?q=AhaKV:+Adaptive+Holistic+Attention-Driven+KV+Cache+Eviction+for+Efficient+Inference+of+Large+Language+Models

21. G-KV: Decoding-Time KV Cache Eviction with Global Attention — approx. recent LLM efficiency authors, 2024/2025

https://scholar.google.com/scholar?q=G-KV:+Decoding-Time+KV+Cache+Eviction+with+Global+Attention

22. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference — approx. recent LLM inference authors, 2024/2025

https://scholar.google.com/scholar?q=LLMs+Know+What+to+Drop:+Self-Attention+Guided+KV+Cache+Eviction+for+Efficient+Long-Context+Inference

23. Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-Level Caching — approx. recent systems authors, 2024/2025

https://scholar.google.com/scholar?q=Harnessing+Your+DRAM+and+SSD+for+Sustainable+and+Accessible+LLM+Inference+with+Mixed-Precision+and+Multi-Level+Caching

24. Efficient LLM Inference Using Dynamic Input Pruning and Cache-Aware Masking — approx. recent mobile/edge inference authors, 2024/2025

https://scholar.google.com/scholar?q=Efficient+LLM+Inference+Using+Dynamic+Input+Pruning+and+Cache-Aware+Masking

25. SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving — approx. recent edge serving authors, 2024/2025

https://scholar.google.com/scholar?q=SLED:+A+Speculative+LLM+Decoding+Framework+for+Efficient+Edge+Serving

26. DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding — approx. recent edge/cloud systems authors, 2024/2025

https://scholar.google.com/scholar?q=DSSD:+Efficient+Edge-Device+LLM+Deployment+and+Collaborative+Inference+via+Distributed+Split+Speculative+Decoding

27. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3

28. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3

29. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3

30. AI Post Transformers: QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/quantspec-hierarchical-kv-cache-for-self-speculative-decoding/

31. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/cxl-speckv-bridging-the-llm-memory-wall-with-speculative-fpga-disaggregation/

32. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/

33. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/

34. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3

Interactive Visualization: DRAM-Free In-Flash Computing for LLM Inference

...more

View all episodes

By mcgrof