September 20, 2025

Adaptive Compression Techniques for Efficient LLM Inference

34 minutes

These 14 research papers provide an overview of various **compression techniques for Large Language Models (LLMs)**, primarily focusing on **reducing the size and computational overhead of the Key-Value (KV) cache** to handle long contexts more efficiently. Several novel methods are detailed, including **GVote**, an adaptive compression algorithm using query sampling and voting to find an optimal cache budget, and **SnapKV**, which selects clustered, important KV positions based on an "observation" window to maintain performance while increasing speed and memory efficiency. Other approaches include **POD (Proximal tokens over Distant tokens)**, which reduces redundancy by sharing key states across layers for distant tokens while preserving proximal ones, and **DecoQuant**, a quantization method utilizing matrix decomposition to reduce errors. The sources also examine **prompt compression methods** like **LLMLingua** and **LongLLMLingua**, and describe **CASC (Context-Adaptive Synthesis and Compression)**, a Retrieval-Augmented Generation (RAG) framework that intelligently synthesizes and compresses multi-document contexts to improve answer accuracy in complex domains.

Sources:

https://arxiv.org/pdf/2509.08315

https://arxiv.org/html/2509.09199v1

https://arxiv.org/html/2509.03136v1

https://aclanthology.org/2025.acl-long.1394.pdf

https://proceedings.neurips.cc/paper_files/paper/2024/file/fd0705710bf01b88a60a3d479ea341d9-Paper-Conference.pdf

https://arxiv.org/html/2412.14838v1

https://arxiv.org/pdf/2412.02252

https://aclanthology.org/2024.acl-long.133.pdf

https://arxiv.org/html/2508.19357v1

https://aclanthology.org/2024.acl-long.91.pdf

https://arxiv.org/html/2310.05736v2

https://aclanthology.org/2025.naacl-long.368.pdf

https://arxiv.org/pdf/2404.14469

https://aclanthology.org/2024.findings-emnlp.266.pdf

...more

View all episodes

By mcgrof

September 20, 2025

Adaptive Compression Techniques for Efficient LLM Inference

34 minutes

Sources:

https://arxiv.org/pdf/2509.08315

https://arxiv.org/html/2509.09199v1

https://arxiv.org/html/2509.03136v1

https://aclanthology.org/2025.acl-long.1394.pdf

https://proceedings.neurips.cc/paper_files/paper/2024/file/fd0705710bf01b88a60a3d479ea341d9-Paper-Conference.pdf

https://arxiv.org/html/2412.14838v1

https://arxiv.org/pdf/2412.02252

https://aclanthology.org/2024.acl-long.133.pdf

https://arxiv.org/html/2508.19357v1

https://aclanthology.org/2024.acl-long.91.pdf

https://arxiv.org/html/2310.05736v2

https://aclanthology.org/2025.naacl-long.368.pdf

https://arxiv.org/pdf/2404.14469

https://aclanthology.org/2024.findings-emnlp.266.pdf

...more

Share Adaptive Compression Techniques for Efficient LLM Inference

Sign up to save your podcasts

Adaptive Compression Techniques for Efficient LLM Inference

Adaptive Compression Techniques for Efficient LLM Inference