February 25, 2026

EP013: Reformer Cracked the Transformer Memory Wall

19 minutes

The paper "Reformer: The Efficient Transformer" introduces a modified Transformer architecture designed to solve the high memory consumption and computational costs associated with training large models on long sequences.

The authors propose three specific techniques to improve efficiency:

• Locality-Sensitive Hashing (LSH) Attention: This replaces the standard dot-product attention—which has a complexity of O(L2)—with a mechanism that uses hashing to approximate nearest neighbors. This reduces the complexity to O(LlogL), allowing the model to handle much longer sequences.

• Reversible Residual Layers: By using reversible layers, the model stores activations only once during the training process instead of storing them for every layer (N times). This significantly lowers the memory footprint.

• Chunking: The authors split activations inside feed-forward layers to process them in chunks, further saving memory.

Key Results: The Reformer achieves performance on par with standard Transformer models while being orders of magnitude more memory-efficient and much faster when processing long sequences (e.g., up to 64,000 tokens).

...more

View all episodes

By Yun Wu

February 25, 2026

EP013: Reformer Cracked the Transformer Memory Wall

19 minutes

The authors propose three specific techniques to improve efficiency:

• Chunking: The authors split activations inside feed-forward layers to process them in chunks, further saving memory.

...more

Share EP013: Reformer Cracked the Transformer Memory Wall

Sign up to save your podcasts

EP013: Reformer Cracked the Transformer Memory Wall

EP013: Reformer Cracked the Transformer Memory Wall