
Sign up to save your podcasts
Or


Woosuk Kwon is CTO of Inferact and creator of the vLLM inference library. Woosuk shares what it takes to build the most popular open-source LLM inference engine from a human-centered perspective.
Outline:
0:00 - Prelude: Introducing Woosuk and Inferact
3:00 - Woosuk’s First PhD Project
6:00 - How the vLLM Project Got Started
9:18 - AI Infra Needs More Than Just Efficiency
14:08 - How AI Infra and Human-centered AI Are Connected
15:01 - How to Prioritize Feature Requests for Popular AI Infra
18:18 - Streaming Requests and Realtime API
24:05 - Multi-turn, Agentic, Proactive LLMs
27:03 - How to Design AI Infra in a Principled Way
29:13 - How to Design an AI Inference Engine for Continue Learning with RL
35:05 - Would LoRA Training Affect RL Infra Design?
37:28 - Why Start an AI Inference Infra Startup?
40:46 - What Effortless Inference with Open-source Models Means for Developers
43:46 - A Vision for On-device AI Inference
46:19- Can Today’s Coding Agents Create vLLM?
References:
Inferact: https://inferact.ai/
Efficient Memory Management for Large Language Model Serving with PagedAttention: https://arxiv.org/abs/2309.06180
Streaming Requests & Realtime API in vLLM: https://vllm.ai/blog/streaming-realtime
RL’s Razor: Why Online Reinforcement Learning Forget Less: https://arxiv.org/abs/2509.04259
Podcast Links:
Podcast website: https://augmented-mind.github.io/
Apple Podcasts: https://podcasts.apple.com/us/podcast/augmented-mind-podcast/id1868102170
Spotify: https://open.spotify.com/show/40KculkYTe2tOpqJm6TAYr?si=PU_UncsMT4mXjVNCRwoXog&nd=1&dlsi=6d9bed7a43d64085
RSS: https://anchor.fm/s/10dbf5b7c/podcast/rss
About the Hosts:
The AM Podcast is hosted by Yijia Shao, Shannon Shen, and Michael Ryan, CS PhD students at Stanford University and MIT.
By Yijia Shao, Shannon Shen, Michael RyanWoosuk Kwon is CTO of Inferact and creator of the vLLM inference library. Woosuk shares what it takes to build the most popular open-source LLM inference engine from a human-centered perspective.
Outline:
0:00 - Prelude: Introducing Woosuk and Inferact
3:00 - Woosuk’s First PhD Project
6:00 - How the vLLM Project Got Started
9:18 - AI Infra Needs More Than Just Efficiency
14:08 - How AI Infra and Human-centered AI Are Connected
15:01 - How to Prioritize Feature Requests for Popular AI Infra
18:18 - Streaming Requests and Realtime API
24:05 - Multi-turn, Agentic, Proactive LLMs
27:03 - How to Design AI Infra in a Principled Way
29:13 - How to Design an AI Inference Engine for Continue Learning with RL
35:05 - Would LoRA Training Affect RL Infra Design?
37:28 - Why Start an AI Inference Infra Startup?
40:46 - What Effortless Inference with Open-source Models Means for Developers
43:46 - A Vision for On-device AI Inference
46:19- Can Today’s Coding Agents Create vLLM?
References:
Inferact: https://inferact.ai/
Efficient Memory Management for Large Language Model Serving with PagedAttention: https://arxiv.org/abs/2309.06180
Streaming Requests & Realtime API in vLLM: https://vllm.ai/blog/streaming-realtime
RL’s Razor: Why Online Reinforcement Learning Forget Less: https://arxiv.org/abs/2509.04259
Podcast Links:
Podcast website: https://augmented-mind.github.io/
Apple Podcasts: https://podcasts.apple.com/us/podcast/augmented-mind-podcast/id1868102170
Spotify: https://open.spotify.com/show/40KculkYTe2tOpqJm6TAYr?si=PU_UncsMT4mXjVNCRwoXog&nd=1&dlsi=6d9bed7a43d64085
RSS: https://anchor.fm/s/10dbf5b7c/podcast/rss
About the Hosts:
The AM Podcast is hosted by Yijia Shao, Shannon Shen, and Michael Ryan, CS PhD students at Stanford University and MIT.