AI: post transformers

Strata: Efficient Hierarchical Context Caching for LLM Serving


Listen Later

The August 26, 2025 collaboration between Stanford, NVIDIA, Shanghai Jiao Tong University, University of Michigan, University of Colorado Boulder, Carnegie Mellon University introduces **Strata**, a hierarchical context caching framework designed to improve the performance of serving Large Language Models (LLMs) with long context windows. The core problem Strata addresses is that while caching key-value (KV) states is essential for efficiency, transferring large, fragmented cached contexts from slower memory tiers (like CPU memory) back to the GPU creates **severe I/O bottlenecks and performance stalls**. It also describes why paged attention creates data fragmentation when offloading even though its goal is to address memory fragmentation. That is paged attention becomes an issue when using offloading due to large contexts. Strata overcomes these issues through two main innovations: **GPU-assisted I/O** to mitigate data fragmentation and achieve high bandwidth utilization, and **cache-aware request scheduling** to intelligently form balanced batches and overlap unavoidable I/O stalls with complementary tasks. The evaluation shows that Strata significantly reduces the **Time-To-First-Token (TTFT)** and increases throughput compared to state-of-the-art serving systems like vLLM + LMCache and TensorRT-LLM on long-context benchmarks.


Source:

https://arxiv.org/html/2508.18572v1

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof