This episode explores KVSwap, a system for running long-context language models on memory-constrained devices by offloading the growing KV cache to storage such as NVMe, UFS, or eMMC instead of relying on scarce shared RAM. It explains why standard server-style GPU-to-CPU offloading breaks down on phones and edge devices with unified memory, and why disk offloading is only viable if it is carefully designed around storage bottlenecks like low bandwidth, latency, and read amplification. The discussion highlights KVSwap’s core strategy: keep the full KV cache on disk, use a compact in-memory key-side representation to predict needed entries, prefetch them ahead of computation, overlap I/O with decoding, and smooth access patterns with buffering to make reads more sequential. Listeners interested in local AI will find it compelling because it reframes long-context inference as a systems problem at the intersection of transformers, operating systems, and storage architecture.
Sources:
1. KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference — Huawei Zhang, Chunwei Xia, Zheng Wang, 2025
http://arxiv.org/abs/2511.11907
2. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Tianqi Chen et al., 2023
https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU
3. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon et al., 2023
https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention
4. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval — Sheng Shen et al., 2024
https://scholar.google.com/scholar?q=RetrievalAttention:+Accelerating+Long-Context+LLM+Inference+via+Vector+Retrieval
5. InfiniGen — Not clearly specified in the excerpt, 2024
https://scholar.google.com/scholar?q=InfiniGen
6. Mooncake — Not clearly specified in the excerpt, 2024
https://scholar.google.com/scholar?q=Mooncake
7. SnapKV — Li et al., 2024
https://scholar.google.com/scholar?q=SnapKV
8. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhang et al., 2023
https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models
9. StreamingLLM — Xiao et al., 2024
https://scholar.google.com/scholar?q=StreamingLLM
10. PyramidInfer: Pyramid KV Cache Compression for High-Throughput LLM Inference — approx. recent LLM systems authors, 2024/2025
https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-Throughput+LLM+Inference
11. Inference-Time Hyper-Scaling with KV Cache Compression — approx. recent LLM inference authors, 2024/2025
https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression
12. MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference — approx. recent multimodal inference authors, 2024/2025
https://scholar.google.com/scholar?q=MadaKV:+Adaptive+Modality-Perception+KV+Cache+Eviction+for+Efficient+Multimodal+Long-Context+Inference
13. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks — approx. recent long-context LLM authors, 2024/2025
https://scholar.google.com/scholar?q=Model+Tells+You+Where+to+Merge:+Adaptive+KV+Cache+Merging+for+LLMs+on+Long-Context+Tasks
14. KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments — approx. recent efficient inference authors, 2024/2025
https://scholar.google.com/scholar?q=KeyDiff:+Key+Similarity-Based+KV+Cache+Eviction+for+Long-Context+LLM+Inference+in+Resource-Constrained+Environments
15. CHESS: Context-Aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference — approx. recent long-context inference authors, 2024/2025
https://scholar.google.com/scholar?q=CHESS:+Context-Aware+Hierarchical+Efficient+Semantic+Selection+for+Long-Context+LLM+Inference
16. Compressing Context to Enhance Inference Efficiency of Large Language Models — approx. recent LLM efficiency authors, 2024/2025
https://scholar.google.com/scholar?q=Compressing+Context+to+Enhance+Inference+Efficiency+of+Large+Language+Models
17. HyperAttention: Long-Context Attention in Near-Linear Time — approx. recent attention-mechanism authors, 2024/2025
https://scholar.google.com/scholar?q=HyperAttention:+Long-Context+Attention+in+Near-Linear+Time
18. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — approx. recent hybrid-attention authors, 2024/2025
https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning
19. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-Attention — approx. recent long-context architecture authors, 2024
https://scholar.google.com/scholar?q=Leave+No+Context+Behind:+Efficient+Infinite+Context+Transformers+with+Infini-Attention
20. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3
21. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/
22. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3
23. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3
24. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3
25. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3
26. AI Post Transformers: Accelerating LLM Cold Starts with Programmable Page Cache — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-accelerating-llm-cold-starts-with-progra-0912d1.mp3
Interactive Visualization: KVSwap for Disk-Aware Long-Context On-Device Inference