These 14 research papers provide an overview of various **compression techniques for Large Language Models (LLMs)**, primarily focusing on **reducing the size and computational overhead of the Key-Value (KV) cache** to handle long contexts more efficiently. Several novel methods are detailed, including **GVote**, an adaptive compression algorithm using query sampling and voting to find an optimal cache budget, and **SnapKV**, which selects clustered, important KV positions based on an "observation" window to maintain performance while increasing speed and memory efficiency. Other approaches include **POD (Proximal tokens over Distant tokens)**, which reduces redundancy by sharing key states across layers for distant tokens while preserving proximal ones, and **DecoQuant**, a quantization method utilizing matrix decomposition to reduce errors. The sources also examine **prompt compression methods** like **LLMLingua** and **LongLLMLingua**, and describe **CASC (Context-Adaptive Synthesis and Compression)**, a Retrieval-Augmented Generation (RAG) framework that intelligently synthesizes and compresses multi-document contexts to improve answer accuracy in complex domains.
Sources:
https://arxiv.org/pdf/2509.08315
https://arxiv.org/html/2509.09199v1
https://arxiv.org/html/2509.03136v1
https://aclanthology.org/2025.acl-long.1394.pdf
https://proceedings.neurips.cc/paper_files/paper/2024/file/fd0705710bf01b88a60a3d479ea341d9-Paper-Conference.pdf
https://arxiv.org/html/2412.14838v1
https://arxiv.org/pdf/2412.02252
https://aclanthology.org/2024.acl-long.133.pdf
https://arxiv.org/html/2508.19357v1
https://aclanthology.org/2024.acl-long.91.pdf
https://arxiv.org/html/2310.05736v2
https://aclanthology.org/2025.naacl-long.368.pdf
https://arxiv.org/pdf/2404.14469
https://aclanthology.org/2024.findings-emnlp.266.pdf