AI Post Transformers

Adaptive Compression Techniques for Efficient LLM Inference


Listen Later

These 14 research papers provide an overview of various compression techniques for Large Language Models (LLMs), primarily focusing on reducing the size and computational overhead of the Key-Value (KV) cache to handle long contexts more efficiently. Several novel methods are detailed, including GVote, an adaptive compression algorithm using query sampling and voting to find an optimal cache budget, and SnapKV, which selects clustered, important KV positions based on an "observation" window to maintain performance while increasing speed and memory efficiency. Other approaches include POD (Proximal tokens over Distant tokens), which reduces redundancy by sharing key states across layers for distant tokens while preserving proximal ones, and DecoQuant, a quantization method utilizing matrix decomposition to reduce errors. The sources also examine prompt compression methods like LLMLingua and LongLLMLingua, and describe CASC (Context-Adaptive Synthesis and Compression), a Retrieval-Augmented Generation (RAG) framework that intelligently synthesizes and compresses multi-document contexts to improve answer accuracy in complex domains.Sources:https://arxiv.org/pdf/2509.08315https://arxiv.org/html/2509.09199v1https://arxiv.org/html/2509.03136v1https://aclanthology.org/2025.acl-long.1394.pdfhttps://proceedings.neurips.cc/paper_files/paper/2024/file/fd0705710bf01b88a60a3d479ea341d9-Paper-Conference.pdfhttps://arxiv.org/html/2412.14838v1https://arxiv.org/pdf/2412.02252https://aclanthology.org/2024.acl-long.133.pdfhttps://arxiv.org/html/2508.19357v1https://aclanthology.org/2024.acl-long.91.pdfhttps://arxiv.org/html/2310.05736v2https://aclanthology.org/2025.naacl-long.368.pdfhttps://arxiv.org/pdf/2404.14469https://aclanthology.org/2024.findings-emnlp.266.pdf
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof