Share Value Residual Learning

Copy link

May 27, 2026

Value Residual Learning

22 minutes

In this episode:
• Welcome and the Over-smoothing Problem: Professor Norris and Linda introduce the episode and discuss how deep Transformers suffer from over-smoothing, losing initial token-level information in later layers.
• Introducing ResFormer and Value Residuals: Linda explains the core mechanism of ResFormer, which adds a residual connection specifically to the Value vectors from the first layer.
• Efficiency and Performance Gains: The hosts analyze the impressive efficiency metrics of ResFormer, including reductions in parameter count and training data, and address whether the model is actually better or just faster.
• SVFormer and the KV Cache Dilemma: The discussion shifts to SVFormer, a variant that shares a single value state across all layers to cut KV cache memory in half, and its relationship with sequence length.
• Final Thoughts and Wrap-up: Norris and Linda conclude with thoughts on how this paper impacts the deployment of large language models and sign off.

...more

View all episodes

By Mechanical Dirk

May 27, 2026

Value Residual Learning

22 minutes

...more

Sign up to save your podcasts