The 2024 paper introduces SYMPHONY, a novel system designed to improve memory management and scheduling for Large Language Model (LLM) inference workloads, particularly those involving multi-turn interactions like chatbots and AI agents. The authors, researchers from the University of Texas-Austin and the University of Wisconsin-Madison, explain that existing LLM serving engines either waste computation by recomputing Key-Value (K,V) caches or suffer from load imbalance by offloading caches to host memory, creating stateful workloads. SYMPHONY addresses these issues by using "advisory requests"—signals indicating the likely arrival of a new request—to proactively migrate K,V caches off the critical serving path, thereby enabling fine-grained scheduling and load balancing. Evaluation results demonstrate that SYMPHONY significantly reduces latency and can handle over eight times the number of requests compared to state-of-the-art baselines. Source: December 21, 2024 SYMPHONY: Improving Memory Management for LLM Inference Workloads https://arxiv.org/pdf/2412.16434