Mechanical Dreams

NOBLE- Accelerating Transformers with Nonlinear Low-Rank Branches


Listen Later

In this episode:
• A Noble Introduction: Professor Norris makes a pun about aristocracy while Linda introduces the paper 'NOBLE' from Canva Research, setting the stage for a discussion on accelerating Transformer pretraining.
• The Linear Collapse Problem: Linda explains why standard LoRA doesn't work for pretraining from scratch, and Norris helps clarify the difference between parameter-efficient fine-tuning and architectural augmentation.
• Anatomy of a Nonlinear Branch: A deep dive into the NOBLE architecture and the 'CosNet' activation function, discussing why a cosine sandwich is better than ReLU for low-rank bottlenecks.
• Crunching the Numbers: The hosts discuss the experimental results, highlighting the 1.47x step speedup and debating whether the parameter overhead is worth the wall-clock time savings.
• The Mixup Mystery: Linda reveals a fascinating caveat regarding Mixup/CutMix augmentation, leading to a theoretical realization about NOBLE's role in learning high-frequency signals versus smooth global trends.
• Inference and Impact: The duo wraps up by discussing the trade-offs, specifically the permanent inference cost, and gives their final verdict on whether NOBLE is the future of pretraining.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk