AI + a16z

Benchmarking AI Agents on Full-Stack Coding


Listen Later

In this episode, a16z General Partner Martin Casado sits down with Sujay Jayakar, co-founder and Chief Scientist at Convex, to talk about his team’s latest work benchmarking AI agents on full-stack coding tasks. From designing Fullstack Bench to the quirks of agent behavior, the two dig into what’s actually hard about autonomous software development, and why robust evals—and guardrails like type safety—matter more than ever. They also get tactical: which models perform best for real-world app building? How should developers think about trajectory management and variance across runs? And what changes when you treat your toolchain like part of the prompt? Whether you're a hobbyist developer or building the next generation of AI-powered devtools, Sujay’s systems-level insights are not to be missed.

Drawing from Sujay’s work developing the Fullstack-Bench, they cover:

  • Why full-stack coding is still a frontier task for autonomous agents
  • How type safety and other “guardrails” can significantly reduce variance and failure
  • What makes a good eval—and why evals might matter more than clever prompts
  • How different models perform on real-world app-building tasks (and what to watch out for)
  • Why your toolchain might be the most underrated part of the prompt
  • And what all of this means for devs—from hobbyists to infra teams building with AI in the loop

Learn More:

Introducing Fullstack-Bench

Follow everyone on X:

Sujay Jayakar

Martin Casado

Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.

...more
View all episodesView all episodes
Download on the App Store

AI + a16zBy a16z

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

26 ratings


More shows like AI + a16z

View all
This Week in Startups by Jason Calacanis

This Week in Startups

1,266 Listeners

a16z Podcast by Andreessen Horowitz

a16z Podcast

999 Listeners

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch by Harry Stebbings

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

509 Listeners

Invest Like the Best with Patrick O'Shaughnessy by Colossus | Investing & Business Podcasts

Invest Like the Best with Patrick O'Shaughnessy

2,294 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

207 Listeners

Practical AI by Practical AI LLC

Practical AI

188 Listeners

The Logan Bartlett Show by by Redpoint Ventures

The Logan Bartlett Show

190 Listeners

web3 with a16z crypto by a16z crypto, Sonal Chokshi, Chris Dixon

web3 with a16z crypto

61 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

127 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

65 Listeners

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

428 Listeners

The Ben & Marc Show by Marc Andreessen, Ben Horowitz

The Ben & Marc Show

120 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

432 Listeners

Lightcone Podcast by Y Combinator

Lightcone Podcast

20 Listeners

Training Data by Sequoia Capital

Training Data

37 Listeners