
Sign up to save your podcasts
Or


Shay breaks down the 2017 paper "Attention Is All You Need" and introduces the transformer: a non-recurrent architecture that uses self-attention to process entire sequences in parallel.
The episode explains positional encoding, how self-attention creates context-aware token representations, the three key advantages over RNNs (parallelization, global receptive field, and precise signal mixing), the quadratic computational trade-off, and teases a follow-up episode that will dive into the math behind attention.
By Sheetal ’Shay’ DharShay breaks down the 2017 paper "Attention Is All You Need" and introduces the transformer: a non-recurrent architecture that uses self-attention to process entire sequences in parallel.
The episode explains positional encoding, how self-attention creates context-aware token representations, the three key advantages over RNNs (parallelization, global receptive field, and precise signal mixing), the quadratic computational trade-off, and teases a follow-up episode that will dive into the math behind attention.