The Meta Superintelligence Labs team in collaboration with Rice University and National University of Singapore have followed up with a version 2 of their REFRAG paper on October 12, 2025, now with actual details of how they pulled off their largest RAG innovations. We had a podcast coverage of their first version of their pre-print paper where no details were given. Fortunately this new paper does address all the concerns we had about lack of clarity.
Their paper introduce and validate **REFRAG**, a novel and efficient decoding framework designed to improve the performance of Large Language Models (LLMs) in **Retrieval-Augmented Generation (RAG)** applications. REFRAG addresses the latency and memory issues associated with long-context inputs by exploiting the **sparse attention patterns** common in RAG contexts, implementing a method that **compresses, senses, and expands** context representations using chunk embeddings. Experimental results demonstrate significant performance gains, including up to **30.85× Time-to-First-Token (TTFT) acceleration** compared to baseline models without sacrificing accuracy across diverse tasks like RAG, multi-turn conversations, and long document summarization. Furthermore, the paper highlights that REFRAG's ability to compress context allows for the **extension of the LLMs' effective context window**, leading to enhanced accuracy in various applications.
Source:
https://arxiv.org/pdf/2509.01092