Share REFRAG: Rethinking RAG based Decoding

Copy link

December 13, 2025

REFRAG: Rethinking RAG based Decoding

13 minutes

This paperq introduces REFRAG, an innovative and efficient decoding framework specifically designed to accelerate *lRetrieval-Augmented Generation (RAG) in Large Language Models (LLMs) by addressing high latency and memory demands associated with long-context inputs. The core mechanism involves compressing context by representing chunks of retrieved text as single embeddings, significantly shortening the input sequence to the decoder and exploiting the **sparse attention patterns** inherent in RAG contexts. Through techniques like **selective compression** managed by a lightweight reinforcement learning (RL) policy, REFRAG achieves substantial speed improvements—up to **30.85x faster Time-to-First-Token (TTFT)**—without sacrificing accuracy, and enables LLMs to handle context windows up to **16x larger**. Experimental results confirm that this specialized approach outperforms existing methods like CEPE across various tasks, including RAG, multi-turn conversations, and summarization, highlighting a crucial trade-off balance between knowledge enrichment and system efficiency.

...more

View all episodes

By Enoch H. Kang

December 13, 2025

REFRAG: Rethinking RAG based Decoding

13 minutes

...more

Sign up to save your podcasts