
Sign up to save your podcasts
Or


The October 16, 2025 academic paper introduces **Elastic-Cache**, an innovative, training-free strategy designed to significantly accelerate the inference speed of diffusion large language models (DLMs) by optimizing Key-Value (KV) cache management. Standard DLMs suffer from slow decoding because they redundantly recompute the KV cache for all tokens at every step, despite minimal changes, especially in shallow layers; Elastic-Cache addresses this by introducing an **adaptive, layer-aware refresh policy**. This policy uses a lightweight **attention-aware drift test** on the most-attended token to determine *when* a refresh is necessary and employs a **depth-aware schedule** to decide *where* to recompute, focusing only on deeper, more volatile layers. Experiments demonstrate that this approach achieves substantial throughput speedups—up to 45.1× on longer sequences—with negligible loss in accuracy compared to baseline and fixed-period caching methods. The method also incorporates **block-wise caching** for distant MASK tokens to further reduce computational overhead.
Source:
https://arxiv.org/pdf/2510.14973
By mcgrofThe October 16, 2025 academic paper introduces **Elastic-Cache**, an innovative, training-free strategy designed to significantly accelerate the inference speed of diffusion large language models (DLMs) by optimizing Key-Value (KV) cache management. Standard DLMs suffer from slow decoding because they redundantly recompute the KV cache for all tokens at every step, despite minimal changes, especially in shallow layers; Elastic-Cache addresses this by introducing an **adaptive, layer-aware refresh policy**. This policy uses a lightweight **attention-aware drift test** on the most-attended token to determine *when* a refresh is necessary and employs a **depth-aware schedule** to decide *where* to recompute, focusing only on deeper, more volatile layers. Experiments demonstrate that this approach achieves substantial throughput speedups—up to 45.1× on longer sequences—with negligible loss in accuracy compared to baseline and fixed-period caching methods. The method also incorporates **block-wise caching** for distant MASK tokens to further reduce computational overhead.
Source:
https://arxiv.org/pdf/2510.14973