April 25, 2026

EP163: Why AI Models Only Remember Five Percent

22 minutes

The paper "Language Model Memory and Memory Models for Language" explores the capacity of machine learning models to store input information in hidden layer vector embeddings. The research identifies that standard causal language models typically produce "information-poor" embeddings because the objective of next-token prediction does not require the model to retain arbitrary input details. In contrast, autoencoders designed for input regeneration demonstrate nearly perfect memory formation.

To improve memory retention and computational efficiency, the author introduces a parallelizable encoder-decoder memory model architecture. Key contributions and findings include:

Training Paradigms: The paper proposes using combined objective functions—pairing next-token prediction with information-retention tasks like copying—to help models form information-rich memories.
Curriculum Learning: A streamlined training approach is introduced where a high-fidelity encoder is frozen, and decoders are trained first to process memories before learning next-token prediction.
Computational Efficiency: Substituting token sequences with memory embeddings reduces the time-to-first-token, minimizes KV cache sizes, and increases token throughput during inference.
Benchmark Performance: Models trained with these combined objectives show significant improvements in input information-related benchmarks without compromising general language understanding.

The findings also have implications for retrieval-based models, suggesting that current embedding models often lack the necessary information density to identify arbitrary details within text chunks.

...more

View all episodes

By Yun Wu

April 25, 2026

EP163: Why AI Models Only Remember Five Percent

22 minutes

To improve memory retention and computational efficiency, the author introduces a parallelizable encoder-decoder memory model architecture. Key contributions and findings include:

Training Paradigms: The paper proposes using combined objective functions—pairing next-token prediction with information-retention tasks like copying—to help models form information-rich memories.
Curriculum Learning: A streamlined training approach is introduced where a high-fidelity encoder is frozen, and decoders are trained first to process memories before learning next-token prediction.
Computational Efficiency: Substituting token sequences with memory embeddings reduces the time-to-first-token, minimizes KV cache sizes, and increases token throughput during inference.
Benchmark Performance: Models trained with these combined objectives show significant improvements in input information-related benchmarks without compromising general language understanding.

...more

Share EP163: Why AI Models Only Remember Five Percent

Sign up to save your podcasts

EP163: Why AI Models Only Remember Five Percent

EP163: Why AI Models Only Remember Five Percent