The June 2025 paper characterizes and optimizes the Key-Value Cache (KV$) workload patterns associated with serving large language models (LLMs) at a major cloud provider. Using real-world production traces from customer-facing (to-C) and business-facing (to-B) workloads, the authors analyze KV$ reuse behaviors, noting that reuses are significantly skewed, with single-turn requests being as important as multi-turn requests, especially in API-dominated workloads. Crucially, the analysis reveals that KV$ lifespan is ephemeral and that reuse probability follows predictable exponential distributions within specific request categories. Based on these findings, the researchers propose a workload-aware cache eviction policy that significantly improves the cache hit ratio and reduces the query time to first token compared to standard policies like LRU and LFU. Source: https://arxiv.org/pdf/2506.02634v1