April 16, 2026

KVSwap for Disk-Aware Long-Context On-Device Inference

18 minutes

This episode explores KVSwap, a system for running long-context language models on memory-constrained devices by offloading the growing KV cache to storage such as NVMe, UFS, or eMMC instead of relying on scarce shared RAM. It explains why standard server-style GPU-to-CPU offloading breaks down on phones and edge devices with unified memory, and why disk offloading is only viable if it is carefully designed around storage bottlenecks like low bandwidth, latency, and read amplification. The discussion highlights KVSwap’s core strategy: keep the full KV cache on disk, use a compact in-memory key-side representation to predict needed entries, prefetch them ahead of computation, overlap I/O with decoding, and smooth access patterns with buffering to make reads more sequential. Listeners interested in local AI will find it compelling because it reframes long-context inference as a systems problem at the intersection of transformers, operating systems, and storage architecture.

Sources:

1. KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference — Huawei Zhang, Chunwei Xia, Zheng Wang, 2025

http://arxiv.org/abs/2511.11907

2. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Tianqi Chen et al., 2023

https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU

3. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon et al., 2023

https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention

4. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval — Sheng Shen et al., 2024

https://scholar.google.com/scholar?q=RetrievalAttention:+Accelerating+Long-Context+LLM+Inference+via+Vector+Retrieval

5. InfiniGen — Not clearly specified in the excerpt, 2024

https://scholar.google.com/scholar?q=InfiniGen

6. Mooncake — Not clearly specified in the excerpt, 2024

https://scholar.google.com/scholar?q=Mooncake

7. SnapKV — Li et al., 2024

https://scholar.google.com/scholar?q=SnapKV

8. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhang et al., 2023

https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models

9. StreamingLLM — Xiao et al., 2024

https://scholar.google.com/scholar?q=StreamingLLM

10. PyramidInfer: Pyramid KV Cache Compression for High-Throughput LLM Inference — approx. recent LLM systems authors, 2024/2025

https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-Throughput+LLM+Inference

11. Inference-Time Hyper-Scaling with KV Cache Compression — approx. recent LLM inference authors, 2024/2025

https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression

12. MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference — approx. recent multimodal inference authors, 2024/2025

https://scholar.google.com/scholar?q=MadaKV:+Adaptive+Modality-Perception+KV+Cache+Eviction+for+Efficient+Multimodal+Long-Context+Inference

13. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks — approx. recent long-context LLM authors, 2024/2025

https://scholar.google.com/scholar?q=Model+Tells+You+Where+to+Merge:+Adaptive+KV+Cache+Merging+for+LLMs+on+Long-Context+Tasks

14. KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments — approx. recent efficient inference authors, 2024/2025

https://scholar.google.com/scholar?q=KeyDiff:+Key+Similarity-Based+KV+Cache+Eviction+for+Long-Context+LLM+Inference+in+Resource-Constrained+Environments

15. CHESS: Context-Aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference — approx. recent long-context inference authors, 2024/2025

https://scholar.google.com/scholar?q=CHESS:+Context-Aware+Hierarchical+Efficient+Semantic+Selection+for+Long-Context+LLM+Inference

16. Compressing Context to Enhance Inference Efficiency of Large Language Models — approx. recent LLM efficiency authors, 2024/2025

https://scholar.google.com/scholar?q=Compressing+Context+to+Enhance+Inference+Efficiency+of+Large+Language+Models

17. HyperAttention: Long-Context Attention in Near-Linear Time — approx. recent attention-mechanism authors, 2024/2025

https://scholar.google.com/scholar?q=HyperAttention:+Long-Context+Attention+in+Near-Linear+Time

18. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — approx. recent hybrid-attention authors, 2024/2025

https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning

19. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-Attention — approx. recent long-context architecture authors, 2024

https://scholar.google.com/scholar?q=Leave+No+Context+Behind:+Efficient+Infinite+Context+Transformers+with+Infini-Attention

20. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3

21. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/

22. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3

23. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3

24. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3

25. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3

26. AI Post Transformers: Accelerating LLM Cold Starts with Programmable Page Cache — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-17-accelerating-llm-cold-starts-with-progra-0912d1.mp3

Interactive Visualization: KVSwap for Disk-Aware Long-Context On-Device Inference

...more

View all episodes

By mcgrof