Share Cake: Computation and I/O Aware KV Cache Loader

Copy link

November 19, 2025

Cake: Computation and I/O Aware KV Cache Loader

31 minutes

The provided text introduces Cake, a novel system designed to optimize the performance of Large Language Model (LLM) inference by efficiently handling Key-Value (KV) cache preparation for long-context inputs. The main problem addressed is the high Time to First Token (TTFT) caused by the computational overhead of generating the KV cache or the high latency of loading it from low-bandwidth storage, despite using prefix caching. Cake's core innovation is a bidirectional scheduling strategy that utilizes both parallel computation (re-calculating the cache) and I/O loading (fetching the cached data) to minimize latency. Through extensive evaluations, the researchers demonstrate that Cake significantly reduces TTFT (by an average of 2.6x) and incorporates adaptive scheduling to improve overall system throughput under fluctuating resource availability. The analysis further explores how Cake performs across various hardware configurations, sequence lengths, and model architectures, confirming its ability to balance resource utilization where previous solutions focused exclusively on either computation or I/O

...more

View all episodes

By kw

November 19, 2025

Cake: Computation and I/O Aware KV Cache Loader

31 minutes

...more

Sign up to save your podcasts