Share Quest: Query-Aware Sparsity for Efficient LLM Inference

Copy link

October 31, 2025

Quest: Query-Aware Sparsity for Efficient LLM Inference

12 minutes

The August 26, 2024 academic paper introduces **Quest**, a novel algorithm designed to improve the inference efficiency of **Long-Context Large Language Models (LLMs)** by addressing the costly self-attention process caused by a large Key-Value (KV) cache. Quest utilizes **Query-Aware Sparsity** to dynamically identify and select only the **critical KV cache pages** based on the current query token, which significantly reduces the required memory movement during decoding. Unlike previous **Query-Agnostic** methods that evict tokens based on past information, Quest maintains high accuracy by never fully discarding context and achieves substantial speedups in self-attention latency, demonstrating its effectiveness across various long-context tasks. The authors provide a detailed breakdown of the methodology and experimental results showing Quest's superior efficiency and accuracy compared to existing baselines.

Source:

https://arxiv.org/pdf/2406.10774

...more

View all episodes

By mcgrof