
Sign up to save your podcasts
Or


The August 26, 2024 academic paper introduces **Quest**, a novel algorithm designed to improve the inference efficiency of **Long-Context Large Language Models (LLMs)** by addressing the costly self-attention process caused by a large Key-Value (KV) cache. Quest utilizes **Query-Aware Sparsity** to dynamically identify and select only the **critical KV cache pages** based on the current query token, which significantly reduces the required memory movement during decoding. Unlike previous **Query-Agnostic** methods that evict tokens based on past information, Quest maintains high accuracy by never fully discarding context and achieves substantial speedups in self-attention latency, demonstrating its effectiveness across various long-context tasks. The authors provide a detailed breakdown of the methodology and experimental results showing Quest's superior efficiency and accuracy compared to existing baselines.
Source:
https://arxiv.org/pdf/2406.10774
By mcgrofThe August 26, 2024 academic paper introduces **Quest**, a novel algorithm designed to improve the inference efficiency of **Long-Context Large Language Models (LLMs)** by addressing the costly self-attention process caused by a large Key-Value (KV) cache. Quest utilizes **Query-Aware Sparsity** to dynamically identify and select only the **critical KV cache pages** based on the current query token, which significantly reduces the required memory movement during decoding. Unlike previous **Query-Agnostic** methods that evict tokens based on past information, Quest maintains high accuracy by never fully discarding context and achieves substantial speedups in self-attention latency, demonstrating its effectiveness across various long-context tasks. The authors provide a detailed breakdown of the methodology and experimental results showing Quest's superior efficiency and accuracy compared to existing baselines.
Source:
https://arxiv.org/pdf/2406.10774