AI: post transformers

NeurIPS 2025: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free


Listen Later

The research systematically investigates the effects of integrating various gating mechanisms into the standard softmax attention layer, comparing over thirty configurations across dense and Mixture-of-Experts Large Language Models. The central finding demonstrates that applying an elementwise, head-specific sigmoid gate immediately following the Scaled Dot-Product Attention (SDPA) output consistently yields the most substantial improvement in overall performance metrics. This successful gating method also provides superior training stability, allowing models to converge effectively under larger learning rates and mitigating disruptive loss spikes during optimization. The improved efficacy is attributed to two factors: introducing essential non-linearity into the low-rank attention mapping and generating input-dependent sparse gating scores. Crucially, this sparsity acts to normalize attention dynamics, eliminating the 'attention sink' problem where initial tokens dominate attention scores, thereby facilitating notably better long-context extrapolation. These demonstrated benefits led to the incorporation of this specific gated attention design into the forthcoming Qwen3-Next models.


Source:

https://openreview.net/pdf?id=1b7whO4SfY

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof