Share vAttention: Dynamic LLM Memory Without PagedAttention

Copy link

November 19, 2025

vAttention: Dynamic LLM Memory Without PagedAttention

35 minutes

paper titled "vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention," introduces a novel memory management approach called vAttention designed to optimize Large Language Model (LLM) serving systems. The paper primarily critiques PagedAttention, the existing standard for dynamic memory allocation, arguing that it introduces performance overheads and complexity by causing the Key-Value (KV) cache to become non-contiguous in virtual memory. vAttention solves this by decoupling virtual and physical memory allocation using CUDA Virtual Memory Management (VMM) APIs, thereby retaining virtual memory contiguity while mitigating physical memory fragmentation. Through evaluations, the authors demonstrate that vAttention is a simpler, more portable, and often more performant alternative, supporting various attention kernels—including FlashAttention-3—out-of-the-box and achieving throughput improvements up to 1.23× over PagedAttention-based systems. The work also details LLM-specific optimizations, such as deferred reclamation and supporting smaller 64KB page groups, to hide VMM latency and reduce fragmentation

...more

View all episodes

By kw

November 19, 2025

vAttention: Dynamic LLM Memory Without PagedAttention

35 minutes

...more

Sign up to save your podcasts