This episode explores a USENIX FAST'26 paper that addresses the infrastructure bottleneck of loading massive language model weights from storage into accelerator memory during inference deployments. The authors present a programmable page cache framework that achieves 2-4× faster cold start times by exploiting predictable sequential access patterns and XPU affinity, while maintaining full compatibility with existing model formats, inference frameworks, and hardware—unlike prior approaches such as ServerlessLLM and BlitzScale that require custom formats or specific interconnects. The discussion examines why the standard kernel page cache underutilizes modern SSD bandwidth through conservative prefetching and inappropriate LRU eviction policies designed for general workloads, and how a userspace-programmable caching layer can optimize for the specific characteristics of model loading without intrusive kernel modifications. Listeners interested in production ML infrastructure, storage systems optimization, or the operational challenges of deploying large models at scale will find concrete insights into how I/O dominates cold start latency and emerging solutions that bridge the three-orders-of-magnitude gap between SSD and GPU memory bandwidth.
Sources:
1. Accelerating LLM Cold Starts with Programmable Page Cache
https://www.usenix.org/system/files/fast26-liu-yubo.pdf
2. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022
https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models
3. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving — Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023
https://scholar.google.com/scholar?q=AlpaServe:+Statistical+Multiplexing+with+Model+Parallelism+for+Deep+Learning+Serving
4. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang, 2023
https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU
5. ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models — Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 2024
https://scholar.google.com/scholar?q=ServerlessLLM:+Locality-Enhanced+Serverless+Inference+for+Large+Language+Models
6. DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale — Aminabadi et al., 2022
https://scholar.google.com/scholar?q=DeepSpeed-Inference:+Enabling+Efficient+Inference+of+Transformer+Models+at+Unprecedented+Scale
7. ZeRO-Offload: Democratizing Billion-Scale Model Training — Ren et al., 2021
https://scholar.google.com/scholar?q=ZeRO-Offload:+Democratizing+Billion-Scale+Model+Training
8. Safetensors: Simple, safe way to store and distribute tensors — HuggingFace, 2022
https://scholar.google.com/scholar?q=Safetensors:+Simple,+safe+way+to+store+and+distribute+tensors
9. AI Post Transformers: LLM Cold Starts: Fixing Linux Page Cache for Model Loading — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-llm-cold-starts-fixing-linux-page-cache-a9f9a9.mp3
10. AI Post Transformers: SolidAttention: Efficient SSD-based KV Cache Offloading for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-solidattention-efficient-ssd-based-kv-ca-336b79.mp3
11. AI Post Transformers: Bidaw: Computation-Storage Aware KV Caching for LLMs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-bidaw-computation-storage-aware-kv-cachi-9d89fb.mp3
12. AI Post Transformers: xLLM: Co-Locating Online and Offline LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-xllm-co-locating-online-and-offline-llm-10bb81.mp3
Interactive Visualization: Accelerating LLM Cold Starts with Programmable Page Cache