AI: post transformers

Elastic-Cache: Adaptive KV Caching for Diffusion LLMs


Listen Later

The October 16, 2025 academic paper introduces **Elastic-Cache**, an innovative, training-free strategy designed to significantly accelerate the inference speed of diffusion large language models (DLMs) by optimizing Key-Value (KV) cache management. Standard DLMs suffer from slow decoding because they redundantly recompute the KV cache for all tokens at every step, despite minimal changes, especially in shallow layers; Elastic-Cache addresses this by introducing an **adaptive, layer-aware refresh policy**. This policy uses a lightweight **attention-aware drift test** on the most-attended token to determine *when* a refresh is necessary and employs a **depth-aware schedule** to decide *where* to recompute, focusing only on deeper, more volatile layers. Experiments demonstrate that this approach achieves substantial throughput speedups—up to 45.1× on longer sequences—with negligible loss in accuracy compared to baseline and fixed-period caching methods. The method also incorporates **block-wise caching** for distant MASK tokens to further reduce computational overhead.


Source:

https://arxiv.org/pdf/2510.14973

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof