In this episode:
• A Noble Introduction: Professor Norris makes a pun about aristocracy while Linda introduces the paper 'NOBLE' from Canva Research, setting the stage for a discussion on accelerating Transformer pretraining.
• The Linear Collapse Problem: Linda explains why standard LoRA doesn't work for pretraining from scratch, and Norris helps clarify the difference between parameter-efficient fine-tuning and architectural augmentation.
• Anatomy of a Nonlinear Branch: A deep dive into the NOBLE architecture and the 'CosNet' activation function, discussing why a cosine sandwich is better than ReLU for low-rank bottlenecks.
• Crunching the Numbers: The hosts discuss the experimental results, highlighting the 1.47x step speedup and debating whether the parameter overhead is worth the wall-clock time savings.
• The Mixup Mystery: Linda reveals a fascinating caveat regarding Mixup/CutMix augmentation, leading to a theoretical realization about NOBLE's role in learning high-frequency signals versus smooth global trends.
• Inference and Impact: The duo wraps up by discussing the trade-offs, specifically the permanent inference cost, and gives their final verdict on whether NOBLE is the future of pretraining.