November 19, 2025

vAttention Vs Strata: advanced GPU memory management

35 minutes

We compare and contrast two advanced 2025 memory management and scheduling techniques for optimizing Large Language Model (LLM) serving throughput and latency: vAttention Vs Strata One core innovation discussed is vAttention, which improves upon the popular PagedAttention method by leveraging CUDA Virtual Memory Management (VMM) APIs to keep the KV cache virtually contiguous, thereby simplifying attention kernel portability and reducing performance overheads associated with non-contiguous memory access. The other major focus is Strata, a hierarchical context caching framework that boosts throughput by employing GPU-assisted I/O and cache-aware scheduling to efficiently manage and transfer KV cache data between CPU and GPU memory, specifically mitigating the "delay hit phenomenon" and allowing for on-the-fly data layout transformations. Both systems aim to resolve the efficiency challenges inherent in LLM inference, particularly during the resource-intensive prefill and decode phases, with Strata showing substantial throughput gains over existing hierarchical caching solutions. Ultimately, vAttention and Strata represent different, yet potentially complementary, approaches to addressing the memory fragmentation and I/O bottlenecks that limit LLM serving performance.Sources:January 29, 2025vAttention: Dynamic Memory Management forServing LLMs without PagedAttentionhttps://arxiv.org/pdf/2405.04437August 26 2025Strata: Hierarchical Context Caching for Long Context Language Model Servinghttps://arxiv.org/html/2508.18572v1

...more

View all episodes

By mcgrof

November 19, 2025

vAttention Vs Strata: advanced GPU memory management

35 minutes

...more

Share vAttention Vs Strata: advanced GPU memory management

Sign up to save your podcasts

vAttention Vs Strata: advanced GPU memory management

vAttention Vs Strata: advanced GPU memory management