Machine Learning Street Talk (MLST)

Transformers Need Glasses! - Federico Barbero


Listen Later

Federico Barbero (DeepMind/Oxford) is the lead author of "Transformers Need Glasses!".


Have you ever wondered why LLMs struggle with seemingly simple tasks like counting or copying long strings of text? We break down the theoretical reasons behind these failures, revealing architectural bottlenecks and the challenges of maintaining information fidelity across extended contexts.


Federico explains how these issues are rooted in the transformer's design, drawing parallels to over-squashing in graph neural networks and detailing how the softmax function limits sharp decision-making.


But it's not all bad news! Discover practical "glasses" that can help transformers see more clearly, from simple input modifications to architectural tweaks.


SPONSOR MESSAGES:

***

CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. Check out their super fast DeepSeek R1 hosting!

https://centml.ai/pricing/


Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich.


Goto https://tufalabs.ai/

***


https://federicobarbero.com/


TRANSCRIPT + RESEARCH:

https://www.dropbox.com/s/h7ys83ztwktqjje/Federico.pdf?dl=0


TOC:

1. Transformer Limitations: Token Detection & Representation

[00:00:00] 1.1 Transformers fail at single token detection

[00:02:45] 1.2 Representation collapse in transformers

[00:03:21] 1.3 Experiment: LLMs fail at copying last tokens

[00:18:00] 1.4 Attention sharpness limitations in transformers


2. Transformer Limitations: Information Flow & Quantization

[00:18:50] 2.1 Unidirectional information mixing

[00:18:50] 2.2 Unidirectional information flow towards sequence beginning in transformers

[00:21:50] 2.3 Diagonal attention heads as expensive no-ops in LAMA/Gemma

[00:27:14] 2.4 Sequence entropy affects transformer model distinguishability

[00:30:36] 2.5 Quantization limitations lead to information loss & representational collapse

[00:38:34] 2.6 LLMs use subitizing as opposed to counting algorithms


3. Transformers and the Nature of Reasoning

[00:40:30] 3.1 Turing completeness conditions in transformers

[00:43:23] 3.2 Transformers struggle with sequential tasks

[00:45:50] 3.3 Windowed attention as solution to information compression

[00:51:04] 3.4 Chess engines: mechanical computation vs creative reasoning

[01:00:35] 3.5 Epistemic foraging introduced


REFS:

[00:01:05] Transformers Need Glasses!, Barbero et al.

https://proceedings.neurips.cc/paper_files/paper/2024/file/b1d35561c4a4a0e0b6012b2af531e149-Paper-Conference.pdf


[00:05:30] Softmax is Not Enough, Veličković et al.

https://arxiv.org/abs/2410.01104


[00:11:30] Adv Alg Lecture 15, Chawla

https://pages.cs.wisc.edu/~shuchi/courses/787-F09/scribe-notes/lec15.pdf


[00:15:05] Graph Attention Networks, Veličković

https://arxiv.org/abs/1710.10903


[00:19:15] Extract Training Data, Carlini et al.

https://arxiv.org/pdf/2311.17035


[00:31:30] 1-bit LLMs, Ma et al.

https://arxiv.org/abs/2402.17764


[00:38:35] LLMs Solve Math, Nikankin et al.

https://arxiv.org/html/2410.21272v1


[00:38:45] Subitizing, Railo

https://link.springer.com/10.1007/978-1-4419-1428-6_578


[00:43:25] NN & Chomsky Hierarchy, Delétang et al.

https://arxiv.org/abs/2207.02098


[00:51:05] Measure of Intelligence, Chollet

https://arxiv.org/abs/1911.01547


[00:52:10] AlphaZero, Silver et al.

https://pubmed.ncbi.nlm.nih.gov/30523106/


[00:55:10] Golden Gate Claude, Anthropic

https://www.anthropic.com/news/golden-gate-claude


[00:56:40] Chess Positions, Chase & Simon

https://www.sciencedirect.com/science/article/abs/pii/0010028573900042


[01:00:35] Epistemic Foraging, Friston

https://www.frontiersin.org/journals/computational-neuroscience/articles/10.3389/fncom.2016.00056/full

...more
View all episodesView all episodes
Download on the App Store

Machine Learning Street Talk (MLST)By Machine Learning Street Talk (MLST)

  • 4.7
  • 4.7
  • 4.7
  • 4.7
  • 4.7

4.7

85 ratings


More shows like Machine Learning Street Talk (MLST)

View all
Data Skeptic by Kyle Polich

Data Skeptic

481 Listeners

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

441 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

298 Listeners

Practical AI by Practical AI LLC

Practical AI

192 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

198 Listeners

Last Week in AI by Skynet Today

Last Week in AI

287 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

426 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

121 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

201 Listeners

Unsupervised Learning by by Redpoint Ventures

Unsupervised Learning

50 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

75 Listeners

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

491 Listeners

AI + a16z by a16z

AI + a16z

31 Listeners

Lightcone Podcast by Y Combinator

Lightcone Podcast

22 Listeners

Training Data by Sequoia Capital

Training Data

43 Listeners