AI: post transformers

Tempo: SLO-Aware LLM Serving Maximizing Service Gain


Listen Later

The April 24, 2025 academic paper introduces **Tempo**, a novel scheduling system designed to optimize Large Language Model (LLM) serving by addressing the wide variety of Service Level Objectives (**SLOs**) in modern LLM applications. The authors categorize requests into three types—**latency-sensitive**, **throughput-intensive**, and **collective requests**—each with distinct performance requirements that existing schedulers fail to manage effectively. Tempo maximizes "service gain" by allocating just enough serving bandwidth to meet each request’s SLO, utilizing a **hybrid scheduling strategy** that relies on lightweight prediction models for conservative initial estimates of response length and **dependency-graph matching** for complex workflows. Evaluations demonstrate that Tempo significantly outperforms state-of-the-art systems in terms of both service gain and **SLO goodput** across diverse workloads and models.


Source:

April 24, 2025

Tempo: Application-aware LLM Serving with Mixed SLO Requirements

https://arxiv.org/pdf/2504.20068

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof