May 10, 2026

The Principles of Diffusion Models - with Jesse Lai (Sony AI)

55 minutes

We host Chieh-Hsin (Jesse) Lai, Staff Research Scientist at Sony AI and visiting professor at National Yang Ming Chiao Tung University, Taiwan, for a conversation about diffusion models, the technology behind tools like Stable Diffusion, and most of the AI image and video generators you've seen in the last few years. Jesse recently co-authored The Principles of Diffusion Models with Stefano Ermon, and the book is quickly becoming a go-to reference in the field.

We start with what a generative model actually is, and what it means to "generate" an image or a sound. Jesse explains the core idea behind diffusion in plain terms. You start with pure noise, and a neural network gradually cleans it up, step by step, until a realistic image emerges.

From there, we talk about why diffusion has come to dominate so much of generative AI. Because the model builds an image gradually, you can guide it along the way, nudging the output toward what you actually want, refining details, or combining it with other controls. We also discuss the common critique that diffusion is slow and how the field has largely addressed it through new techniques.

We zoom out to the bigger picture, too. Jesse shares his view on world models and whether diffusion is the right foundation for them. We talk about what makes a generative model genuinely good versus just good at gaming benchmarks, and why evaluating creativity and realism is so much harder than scoring a multiple-choice test.

Timeline

00:12 — Intro and welcoming Jesse

00:47 — Why Jesse wrote the book, and who it's for

03:29 — The three families of diffusion models, and why they're really one idea

05:14 — What makes a good generative model

07:39 — How do you even measure if a generated image is good

08:59 — Why diffusion beats autoregressive models for images

10:33 — Is diffusion still slow? How fast generation got fast

11:12 — A simple intuition for what a "score" is

14:12 — How the different flavors of diffusion connect under the hood

14:42 — Diffusion for text and proteins

17:12 — Consistency models and the push for one-step generation

22:12 — Diffusion for world models: simulating reality in real time

26:12 — Do world models need to understand language

35:12 — Is diffusion the right tool, or just a convenient one

38:12 — What benchmarks actually tell us, and what they miss

46:12 — Closing thoughts and where to find the book

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

...more

View all episodes

By Ravid Shwartz-Ziv & Allen Roush

44 ratings