
Sign up to save your podcasts
Or


The paper "Reformer: The Efficient Transformer" introduces a modified Transformer architecture designed to solve the high memory consumption and computational costs associated with training large models on long sequences.
The authors propose three specific techniques to improve efficiency:
• Locality-Sensitive Hashing (LSH) Attention: This replaces the standard dot-product attention—which has a complexity of O(L2)—with a mechanism that uses hashing to approximate nearest neighbors. This reduces the complexity to O(LlogL), allowing the model to handle much longer sequences.
• Reversible Residual Layers: By using reversible layers, the model stores activations only once during the training process instead of storing them for every layer (N times). This significantly lowers the memory footprint.
• Chunking: The authors split activations inside feed-forward layers to process them in chunks, further saving memory.
Key Results: The Reformer achieves performance on par with standard Transformer models while being orders of magnitude more memory-efficient and much faster when processing long sequences (e.g., up to 64,000 tokens).
By Yun WuThe paper "Reformer: The Efficient Transformer" introduces a modified Transformer architecture designed to solve the high memory consumption and computational costs associated with training large models on long sequences.
The authors propose three specific techniques to improve efficiency:
• Locality-Sensitive Hashing (LSH) Attention: This replaces the standard dot-product attention—which has a complexity of O(L2)—with a mechanism that uses hashing to approximate nearest neighbors. This reduces the complexity to O(LlogL), allowing the model to handle much longer sequences.
• Reversible Residual Layers: By using reversible layers, the model stores activations only once during the training process instead of storing them for every layer (N times). This significantly lowers the memory footprint.
• Chunking: The authors split activations inside feed-forward layers to process them in chunks, further saving memory.
Key Results: The Reformer achieves performance on par with standard Transformer models while being orders of magnitude more memory-efficient and much faster when processing long sequences (e.g., up to 64,000 tokens).