
Sign up to save your podcasts
Or


"Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" addresses the challenge of enabling transformer models to process sequences at inference time that are longer than those encountered during training. Traditional transformer language models rely on positional embedding methods (such as sinusoidal embeddings) that exhibit weak extrapolation capabilities, leading to degraded performance when processing extended contexts.
To solve this, the authors introduce Attention with Linear Biases (ALiBi), a simpler and highly efficient method that completely eliminates the need to add positional embeddings to word embeddings. Instead, ALiBi applies a static, non-learned bias directly to the query-key attention scores, negatively biasing them with a penalty proportional to the distance between the query and key. This creates an inductive bias towards recency, penalizing attention between distant tokens.
The key benefits and findings of ALiBi include:
• Efficient Extrapolation: ALiBi allows models to be trained on shorter sequences—which is significantly faster and cheaper—while maintaining strong performance on much longer sequences at runtime.
• Reduced Resource Consumption: Because models can be trained on shorter inputs, ALiBi significantly reduces training time and memory usage. For example, a 1.3 billion parameter model trained on sequences of 1024 tokens with ALiBi achieves the same perplexity as a sinusoidal model trained on 2048 tokens, while training 11% faster and using 11% less memory.
• Superior Performance: ALiBi consistently outperforms existing position methods, including sinusoidal, rotary, and T5 bias methods, across multiple benchmarks like WikiText-103 and the Toronto BookCorpus. It adds no additional runtime penalty and requires only a few lines of code to implement.
By Yun Wu"Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" addresses the challenge of enabling transformer models to process sequences at inference time that are longer than those encountered during training. Traditional transformer language models rely on positional embedding methods (such as sinusoidal embeddings) that exhibit weak extrapolation capabilities, leading to degraded performance when processing extended contexts.
To solve this, the authors introduce Attention with Linear Biases (ALiBi), a simpler and highly efficient method that completely eliminates the need to add positional embeddings to word embeddings. Instead, ALiBi applies a static, non-learned bias directly to the query-key attention scores, negatively biasing them with a penalty proportional to the distance between the query and key. This creates an inductive bias towards recency, penalizing attention between distant tokens.
The key benefits and findings of ALiBi include:
• Efficient Extrapolation: ALiBi allows models to be trained on shorter sequences—which is significantly faster and cheaper—while maintaining strong performance on much longer sequences at runtime.
• Reduced Resource Consumption: Because models can be trained on shorter inputs, ALiBi significantly reduces training time and memory usage. For example, a 1.3 billion parameter model trained on sequences of 1024 tokens with ALiBi achieves the same perplexity as a sinusoidal model trained on 2048 tokens, while training 11% faster and using 11% less memory.
• Superior Performance: ALiBi consistently outperforms existing position methods, including sinusoidal, rotary, and T5 bias methods, across multiple benchmarks like WikiText-103 and the Toronto BookCorpus. It adds no additional runtime penalty and requires only a few lines of code to implement.