May 31, 2026

One Learning Rate Doesn't Fit All: Layerwise Spectral Scheduling for Transformers

26 minutes

Shows that modern transformers are highly heterogeneous across layers and proposes layerwise learning rates based on weight spectrum shape, yielding up to 1.5× training speedup on LLaMA/GPT-style models.

...more

View all episodes

By Shaoqing Tan

May 31, 2026

One Learning Rate Doesn't Fit All: Layerwise Spectral Scheduling for Transformers

26 minutes

...more

Share One Learning Rate Doesn't Fit All: Layerwise Spectral Scheduling for Transformers

Sign up to save your podcasts

One Learning Rate Doesn't Fit All: Layerwise Spectral Scheduling for Transformers

One Learning Rate Doesn't Fit All: Layerwise Spectral Scheduling for Transformers