This research paper introduces the Differential Transformer, a novel architecture for large language models (LLMs) that aims to improve the effectiveness of attention mechanisms. The core innovation is the differential attention mechanism, which calculates attention scores as the difference between two separate softmax attention maps. This subtraction approach helps to cancel out noise in the attention scores, allowing the model to focus more effectively on relevant information. The authors demonstrate that the Differential Transformer consistently outperforms traditional Transformer architectures in various tasks, including language modeling, long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and the reduction of activation outliers. The paper also explores the scalability of the Differential Transformer, showing that it requires fewer parameters and training tokens than the Transformer to achieve comparable performance. Overall, the Differential Transformer is presented as a highly effective and promising architecture for advancing LLMs.