In this episode:
• Welcome & Introduction: Professor Norris and Linda welcome the listeners. Linda introduces the paper of the week, teasing the unexpected comeback of non-linear RNNs.
• The Expressivity Gap: Linear vs. Non-Linear RNNs: The hosts discuss how linear RNNs like Mamba and Gated DeltaNet dominate due to their efficiency, but fundamentally lack the expressive power for complex state-tracking tasks compared to classic non-linear RNNs.
• The Real Bottleneck: State Capacity: Linda explains a key insight from the paper: traditional non-linear RNNs failed at language modeling and in-context retrieval not because of their non-linearity, but because they relied on small, vector-valued hidden states.
• Enter M²RNN: Matrix-Valued States: A deep dive into the Matrix-to-Matrix RNN architecture, focusing on how outer product state expansion and an independent forget gate allow it to achieve massive state capacities.
• Hardware Utilization & Systems Engineering: Professor Norris questions the computational cost. Linda explains the ingenious tiling tricks that maximize Tensor Core utilization without padding waste, plus a look at their Tensor Parallelism strategies.
• Empirical Wins & The Power of Hybrids: Reviewing the benchmark results across state tracking, long-context retrieval, and language modeling, highlighting how swapping just a single layer in a hybrid architecture to M²RNN yields massive performance jumps.
• Conclusion & Wrap-Up: Professor Norris admits he is convinced by the hybrid approach. The hosts summarize the main takeaways and sign off until the next episode.