AI Post Transformers

KV selection algorithms: static (SnapKV) Vs dynamic (PQCache)


Listen Later

We review three different papers which focus on different KV cache optimizations techniques using different KV selection algorithms types: static vs dynamic. StreamingLLM and SnapKV use static KV selection methods. The dynamic strategy KV selection is introduced in PQCache. These different KV selection algorithms explore different KV budgets and speculation lengths to estimate optimal theoretical speedups. A KV cache selection strategy is considered static when the tokens selected for retention are determined either by fixed positional rules or by an initial evaluation of the prompt, remaining unchanged during the generation of new tokens. StreamingLLM (Positional Static Strategy): StreamingLLM employs a static approach by retaining a fixed set of initial tokens alongside a rolling window of the most recent tokens. This is based on the discovery of the "attention sink" phenomenon, where LLMs disproportionately allocate high attention scores to the very first few tokens of a sequence, regardless of their semantic importance, simply to satisfy the SoftMax function's requirement to sum to one. By statically anchoring these initial tokens (usually just 4) and keeping a sliding window of recent tokens, StreamingLLM prevents the model from collapsing during infinite sequence generation. SnapKV (Observation-Based Static Strategy): SnapKV is static because it selects its KV cache before generation begins and keeps this selection fixed. It operates on the observation that the attention allocation pattern of an LLM stays remarkably consistent throughout the generation phase. SnapKV uses an "observation window" at the very end of the user's prompt to "vote" on which preceding KV features are most important. It then clusters and compresses these important features, concatenates them with the observation window, and uses this statically compressed KV cache for all subsequent generation steps. Dynamic KV Selection Strategy (PQCache): A solution is dynamic when the subset of KV pairs used for attention computation changes step-by-step in real-time, depending on the specific token currently being generated. PQCache (Retrieval-Based Dynamic Strategy): PQCache fundamentally treats KV cache selection as an Information Retrieval or Approximate Nearest Neighbor Search (ANNS) problem. It acknowledges a critical flaw in static dropping methods: tokens that initially appear unimportant might suddenly gain relevance in later generation steps. During the autoregressive decoding phase, PQCache uses a lightweight Product Quantization (PQ) technique to compress keys into centroids and codes. For each newly generated token, it dynamically multiplies the token's query with the PQ centroids to approximate attention scores, retrieving only the top-k most relevant KV pairs from the CPU to perform selective attention. PQCache shows the strongest evidence for scaling accurately across massive contexts without losing critical information. By dynamically retrieving the top-k tokens, it improves model quality scores by 4.60% over static methods (like SnapKV) on the InfiniteBench dataset (which averages 100K+ token lengths).Sources:1)September 2023Efficient Streaming Language Models With Attention SinksMassachusetts Institute of Technology, Meta AI, Carnegie Mellon University, NVIDIAGuangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewishttps://arxiv.org/pdf/2309.174532)April 2024SnapKV: LLM Knows What You are Looking for Before GenerationUniversity of Illinois Urbana-Champaign, Cohere, Princeton UniversityYuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chenhttps://arxiv.org/pdf/2404.144693)June 2025PQCache: ProductQuantization-based KVCache for Long Context LLM InferencePeking University, Purdue University, Baichuan Inc.Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cuihttps://arxiv.org/pdf/2407.12820
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof