February 11, 2025

LLM Query Scheduling with Prefix Reuse and Latency Constraints

13 minutes

Research paper: https://arxiv.org/pdf/2502.04677

Authors: Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, and Aman Gupta

Introduction

In this episode, we explore the challenge of efficiently deploying large language models (LLMs) in online settings, where strict latency constraints—such as time-to-first-token (TTFT) and time-per-output-token (TPOT)—must be met. As demand for AI-generated content grows, optimizing inference performance becomes a critical bottleneck.

Key Topics Covered

The Challenge of Query Scheduling: Existing scheduling strategies like First-Come-First-Serve (FCFS) and Longest-Prefix-Match (LPM) struggle to balance efficiency and latency.
Prefix Reuse with RadixAttention: A technique that stores and reuses shared prefixes across queries using a radix tree structure, reducing computational overhead.
The NP-Hard Nature of Scheduling: The paper establishes that optimizing scheduling under TTFT constraints is computationally challenging.
Introducing k-LPM: A novel scheduling algorithm that balances prefix reuse and fairness, outperforming existing methods in reducing TTFT.
Empirical Validation: Real-world evaluations show that k-LPM significantly reduces P99 TTFT, making it a promising solution for large-scale LLM inference.

Conclusion

This research highlights the need for advanced scheduling strategies to improve LLM efficiency in real-world applications. Tune in to learn how k-LPM is pushing the boundaries of AI inference optimization!

...more

View all episodes

By Sunil & Jiten

February 11, 2025

LLM Query Scheduling with Prefix Reuse and Latency Constraints

13 minutes

Research paper: https://arxiv.org/pdf/2502.04677

Authors: Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, and Aman Gupta

Introduction

Key Topics Covered

The Challenge of Query Scheduling: Existing scheduling strategies like First-Come-First-Serve (FCFS) and Longest-Prefix-Match (LPM) struggle to balance efficiency and latency.
Prefix Reuse with RadixAttention: A technique that stores and reuses shared prefixes across queries using a radix tree structure, reducing computational overhead.
The NP-Hard Nature of Scheduling: The paper establishes that optimizing scheduling under TTFT constraints is computationally challenging.
Introducing k-LPM: A novel scheduling algorithm that balances prefix reuse and fairness, outperforming existing methods in reducing TTFT.
Empirical Validation: Real-world evaluations show that k-LPM significantly reduces P99 TTFT, making it a promising solution for large-scale LLM inference.

Conclusion

...more

Share LLM Query Scheduling with Prefix Reuse and Latency Constraints

Sign up to save your podcasts

LLM Query Scheduling with Prefix Reuse and Latency Constraints

LLM Query Scheduling with Prefix Reuse and Latency Constraints