AI: post transformers

Hyper-Scaling LLM Inference with KV Cache Compression


Listen Later

The June 5, 2025 collaboration between University of Edinburgh and Nvidia paper introduces the concept of **inference-time hyper-scaling** for large language models (LLMs), which aims to boost reasoning accuracy by allowing for longer or more parallel token sequences within the same computational budget. The core bottleneck is identified as the size of the key–value (KV) cache, which grows linearly and dominates inference cost. To address this, the authors propose **Dynamic Memory Sparsification (DMS)**, a novel, data-efficient method for compressing the KV cache by learning an adaptive token eviction policy with a **delayed eviction mechanism**. Experiments across various LLMs and reasoning tasks demonstrate that DMS significantly outperforms existing compression methods, effectively expanding the token budget and achieving superior accuracy at comparable runtime and memory loads.


Source:

https://arxiv.org/html/2506.05345v1

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof