February 26, 2026

EP031: DeepMind RETRO Swaps Memorization For Retrieval

18 minutes

The paper "Improving language models by retrieving from trillions of tokens" by DeepMind introduces Retro (Retrieval-Enhanced Transformer), a semi-parametric autoregressive language model that enhances its predictions by directly retrieving information from a massive database of up to 2 trillion tokens.

Instead of relying solely on increasing a model's parameter size to improve memorization and performance, Retro separates the model's computations from its memory. It does this by splitting input sequences into smaller chunks, using a frozen BERT model to retrieve similar text from the database, and integrating this retrieved information into its predictions using a chunked cross-attention (CCA) mechanism.

Key highlights of the paper include:

• High Efficiency: Retro achieves performance comparable to massive models like GPT-3 and Jurassic-1 on datasets like the Pile, despite using 25 times fewer parameters.

• Scalability: The architecture scales effectively; performance consistently improves as both the model size and the retrieval database size increase.

• Downstream Capabilities: Existing pre-trained transformers can be rapidly "Retro-fitted" with this retrieval mechanism to achieve good performance, and the model can be fine-tuned for knowledge-intensive tasks like question answering.

• Addressing Data Leakage: Because retrieval models have direct access to training data, the authors introduce a new evaluation methodology to quantify test set leakage, demonstrating that Retro's strong performance stems from both direct knowledge extraction and genuine generalization.

...more

View all episodes

By Yun Wu

February 26, 2026

EP031: DeepMind RETRO Swaps Memorization For Retrieval

18 minutes

Key highlights of the paper include:

• High Efficiency: Retro achieves performance comparable to massive models like GPT-3 and Jurassic-1 on datasets like the Pile, despite using 25 times fewer parameters.

• Scalability: The architecture scales effectively; performance consistently improves as both the model size and the retrieval database size increase.

...more

Share EP031: DeepMind RETRO Swaps Memorization For Retrieval

Sign up to save your podcasts

EP031: DeepMind RETRO Swaps Memorization For Retrieval

EP031: DeepMind RETRO Swaps Memorization For Retrieval