September 10, 2025

Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed

58 minutes

Fill out this short listener survey to help us improve the show: https://forms.gle/bbcRiPTRwKoG2tJx8

Tri Dao, Chief Scientist at Together AI and Princeton professor who created Flash Attention and Mamba, discusses how inference optimization has driven costs down 100x since ChatGPT's launch through memory optimization, sparsity advances, and hardware-software co-design. He predicts the AI hardware landscape will shift from Nvidia's current 90% dominance to a more diversified ecosystem within 2-3 years, as specialized chips emerge for distinct workload categories: low-latency agentic systems, high-throughput batch processing, and interactive chatbots. Dao shares his surprise at AI models becoming genuinely useful for expert-level work, making him 1.5x more productive at GPU kernel optimization through tools like Claude Code and O1. The conversation explores whether current transformer architectures can reach expert-level AI performance or if approaches like mixture of experts and state space models are necessary to achieve AGI at reasonable costs. Looking ahead, Dao sees another 10x cost reduction coming from continued hardware specialization, improved kernels, and architectural advances like ultra-sparse models, while emphasizing that the biggest challenge remains generating expert-level training data for domains lacking extensive internet coverage.

(0:00) Intro

(1:58) Nvidia's Dominance and Competitors

(4:01) Challenges in Chip Design

(6:26) Innovations in AI Hardware

(9:21) The Role of AI in Chip Optimization

(11:38) Future of AI and Hardware Abstractions

(16:46) Inference Optimization Techniques

(33:10) Specialization in AI Inference

(35:18) Deep Work Preferences and Low Latency Workloads

(38:19) Fleet Level Optimization and Batch Inference

(39:34) Evolving AI Workloads and Open Source Tooling

(41:15) Future of AI: Agentic Workloads and Real-Time Video Generation

(44:35) Architectural Innovations and AI Expert Level

(50:10) Robotics and Multi-Resolution Processing

(52:26) Balancing Academia and Industry in AI Research

(57:37) Quickfire

With your co-hosts:

@jacobeffron

- Partner at Redpoint, Former PM Flatiron Health

@patrickachase

- Partner at Redpoint, Former ML Engineer LinkedIn

@ericabrescia

- Former COO Github, Founder Bitnami (acq’d by VMWare)

@jordan_segall

- Partner at Redpoint

...more

View all episodes

By by Redpoint Ventures

4.9

4949 ratings

September 10, 2025

Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed

58 minutes

Fill out this short listener survey to help us improve the show: https://forms.gle/bbcRiPTRwKoG2tJx8

(0:00) Intro

(1:58) Nvidia's Dominance and Competitors

(4:01) Challenges in Chip Design

(6:26) Innovations in AI Hardware

(9:21) The Role of AI in Chip Optimization

(11:38) Future of AI and Hardware Abstractions

(16:46) Inference Optimization Techniques

(33:10) Specialization in AI Inference

(35:18) Deep Work Preferences and Low Latency Workloads

(38:19) Fleet Level Optimization and Batch Inference

(39:34) Evolving AI Workloads and Open Source Tooling

(41:15) Future of AI: Agentic Workloads and Real-Time Video Generation

(44:35) Architectural Innovations and AI Expert Level

(50:10) Robotics and Multi-Resolution Processing

(52:26) Balancing Academia and Industry in AI Research

(57:37) Quickfire

With your co-hosts:

@jacobeffron

- Partner at Redpoint, Former PM Flatiron Health

@patrickachase

- Partner at Redpoint, Former ML Engineer LinkedIn

@ericabrescia

- Former COO Github, Founder Bitnami (acq’d by VMWare)

@jordan_segall

- Partner at Redpoint

...more

More shows like Unsupervised Learning with Jacob Effron

View all

This Week in Startups

1,288 Listeners

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

537 Listeners

The a16z Show

1,085 Listeners

Y Combinator Startup Podcast

226 Listeners

Machine Learning Street Talk (MLST)

95 Listeners

Dwarkesh Podcast

505 Listeners

No Priors: Artificial Intelligence | Technology | Startups

135 Listeners

Latent Space: The AI Engineer Podcast

94 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis

607 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

472 Listeners

AI + a16z

35 Listeners

Lightcone Podcast

21 Listeners

Training Data

39 Listeners

Uncapped with Jack Altman

43 Listeners

Cheeky Pint

49 Listeners

Share Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed

Sign up to save your podcasts

Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed

Ep 74: Chief Scientist of Together.AI Tri Dao On The End of Nvidia's Dominance, Why Inference Costs Fell & The Next 10X in Speed

More shows like Unsupervised Learning with Jacob Effron

This Week in Startups

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

The a16z Show

Y Combinator Startup Podcast

Machine Learning Street Talk (MLST)

Dwarkesh Podcast

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

The AI Daily Brief: Artificial Intelligence News and Analysis

BG2Pod with Brad Gerstner and Bill Gurley

AI + a16z

Lightcone Podcast

Training Data

Uncapped with Jack Altman

Cheeky Pint