Efficient Memory Management for Large Language Model Serving with PagedAttention addresses the memory bottlenecks associated with serving Large Language Models (LLMs).
Here is a short summary of the paper's core concepts:
- The Problem: High-throughput LLM serving requires batching multiple requests, but this process is heavily bottlenecked by the Key-Value (KV) cache. The KV cache is dynamic, massive, and unpredictable in length, leading existing systems to pre-allocate contiguous memory chunks. This results in significant memory waste due to internal and external memory fragmentation, severely limiting the number of requests that can be batched together.
- The Solution (PagedAttention): The authors propose PagedAttention, an attention algorithm inspired by virtual memory and paging techniques used in operating systems. Instead of requiring contiguous memory, PagedAttention divides the KV cache into fixed-size blocks (like pages) that can be stored in non-contiguous physical memory.
- The System (vLLM): Built on top of PagedAttention, the authors introduce vLLM, a high-throughput, distributed LLM serving engine. vLLM achieves near-zero waste in KV cache memory and allows flexible memory sharing at the block level across different sequences or requests. This is particularly advantageous for complex decoding algorithms like parallel sampling and beam search, where memory can be shared to drastically reduce overhead.
- The Results: Evaluations show that vLLM improves serving throughput by 2-4× compared to previous state-of-the-art systems like FasterTransformer and Orca, while maintaining the same level of latency. The performance gains are most pronounced with larger models, longer sequences, and complex decoding tasks.