
Sign up to save your podcasts
Or


The June 2025 paper characterizes and optimizes the **Key-Value Cache (KV$)** workload patterns associated with serving large language models (LLMs) at a major cloud provider. Using **real-world production traces** from customer-facing (to-C) and business-facing (to-B) workloads, the authors analyze KV$ reuse behaviors, noting that reuses are significantly skewed, with single-turn requests being as important as multi-turn requests, especially in **API-dominated workloads**. Crucially, the analysis reveals that **KV$ lifespan is ephemeral** and that reuse probability follows predictable exponential distributions within specific request categories. Based on these findings, the researchers propose a **workload-aware cache eviction policy** that significantly improves the cache hit ratio and reduces the query time to first token compared to standard policies like LRU and LFU.
Source:
https://arxiv.org/pdf/2506.02634v1
By mcgrofThe June 2025 paper characterizes and optimizes the **Key-Value Cache (KV$)** workload patterns associated with serving large language models (LLMs) at a major cloud provider. Using **real-world production traces** from customer-facing (to-C) and business-facing (to-B) workloads, the authors analyze KV$ reuse behaviors, noting that reuses are significantly skewed, with single-turn requests being as important as multi-turn requests, especially in **API-dominated workloads**. Crucially, the analysis reveals that **KV$ lifespan is ephemeral** and that reuse probability follows predictable exponential distributions within specific request categories. Based on these findings, the researchers propose a **workload-aware cache eviction policy** that significantly improves the cache hit ratio and reduces the query time to first token compared to standard policies like LRU and LFU.
Source:
https://arxiv.org/pdf/2506.02634v1