LLMs Research Podcast

Your 70-Billion-Parameter Model Might Be 40% Wasted


Listen Later

Your 70-Billion-Parameter Model Might Be 40% Wasted

Three papers from February 1–6, 2026 converge on a question the field has been avoiding since 2016: what if most transformer layers aren't doing compositional reasoning at all, but just averaging noise?

This video traces a decade of evidence, from Veit et al.'s original ensemble observation in ResNets through ShortGPT's layer pruning results and October 2025's formal proof, to three new papers that quantify the consequences. Inverse depth scaling shows loss improves as D to the negative 0.30, worse than one-over-n. TinyLoRA unlocks 91% GSM8K accuracy by training just 13 parameters with RL. And the attention sink turns out to be a native Mixture-of-Experts router hiding in plain sight.

The picture that emerges: modern LLMs are simultaneously too deep (layers averaging rather than composing) and too wide (attention heads collapsing into dormancy). Architecturally large, functionally much smaller.

This is a video adaptation of our LLMs Research newsletter issue covering the same papers.

Papers referenced (in order of appearance):

Residual Networks Behave Like Ensembles of Relatively Shallow Networks (Veit, Wilber, Belongie, 2016) https://arxiv.org/abs/1605.06431

Deep Networks with Stochastic Depth (Huang et al., 2016) https://arxiv.org/abs/1603.09382

ALBERT: A Lite BERT for Self-supervised Learning (Lan et al., 2020) https://arxiv.org/abs/1909.11942

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (Men et al., 2024) https://arxiv.org/abs/2403.03853

Your Transformer is Secretly Linear (Razzhigaev et al., 2024) https://arxiv.org/abs/2405.12250

On Residual Network Depth (Dherin, Munn, 2025) https://arxiv.org/abs/2510.03470

Inverse Depth Scaling From Most Layers Being Similar (Liu, Kangaslahti, Liu, Gore, 2026) https://arxiv.org/abs/2602.05970

Learning to Reason in 13 Parameters / TinyLoRA (Morris, Mireshghallah, Ibrahim, Mahloujifar, 2026) https://arxiv.org/abs/2602.04118

Attention Sink Forges Native MoE in Attention Layers (Fu, Zeng, Wang, Li, 2026) https://arxiv.org/abs/2602.01203

Timestamps:

0:00 Why this should bother you 0:41 Veit 2016: ResNets as ensembles 2:14 Stochastic depth, ALBERT, and the quiet accumulation 3:08 ShortGPT, secretly linear transformers, and the formal proof 4:22 February 2026: this week's answer 4:38 Inverse depth scaling: D to the negative 0.30 5:57 Where does capability actually live? 6:23 TinyLoRA: 13 parameters, 91% accuracy 8:35 Width: attention sinks as native MoE 10:58 What this means for architecture, fine-tuning, and inference 11:49 The decade-long arc

Newsletter: https://llmsresearch.substack.com GitHub: https://github.com/llmsresearch



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit llmsresearch.substack.com
...more
View all episodesView all episodes
Download on the App Store

LLMs Research PodcastBy LLMs Research