November 05, 2025

Benchmarking Generalization: How AI Learns Beyond Training Data

36 minutes

In this episode of Inference Time Tactics, Rob and Cooper from Neurometric sit down with Yash Sharma, an AI researcher whose work is reshaping how we understand model generalization. Yash recently completed his PhD at the Max Planck Institute for Intelligent Systems and has held research roles at Google Brain, Meta AI, Amazon, Borealis AI, and IBM Research. His studies on compositional generalization, adversarial robustness, and long-tail benchmarks reveal when and why models succeed—or fail—at reasoning beyond their training data.

If you’re designing inference-time systems, building agents that need reliability, or just want to understand what “generalization” actually means in practice, this conversation bridges deep theory with actionable insight—clear, technical, and strategically grounded.

Key Topics

What it really means for AI systems to generalize beyond their training data

Why large language models still fail in novel or unpredictable scenarios

How inference-time compute can both amplify and reveal generalization limits

What these limits mean for building reliable, agentic AI systems

How to benchmark generalization in real-world settings

Yash’s “Let It Wag!” benchmark for testing long-tail and under-represented concepts

Why genuine scientific breakthroughs (like curing cancer) require more than scaling test-time compute

Connect with Yash Sharma:

Yash Sharma

Let It Wag! Benchmark

Paper: Pretraining Frequency Predicts Compositional Generalization of CLIP (NeurIPS 2024 Workshop)

Connect with Neurometric:

Website: https://www.neurometric.ai/

Substack: https://neurometric.substack.com/

X: https://x.com/neurometric/

Bluesky: https://bsky.app/profile/neurometric.bsky.social

Rob May

https://x.com/robmay

https://www.linkedin.com/in/robmay

Calvin Cooper

https://x.com/cooper_nyc_

https://www.linkedin.com/in/coopernyc

...more

View all episodes

By NeuroMetric AI

November 05, 2025

Benchmarking Generalization: How AI Learns Beyond Training Data

36 minutes

Key Topics

What it really means for AI systems to generalize beyond their training data

Why large language models still fail in novel or unpredictable scenarios

How inference-time compute can both amplify and reveal generalization limits

What these limits mean for building reliable, agentic AI systems

How to benchmark generalization in real-world settings

Yash’s “Let It Wag!” benchmark for testing long-tail and under-represented concepts

Why genuine scientific breakthroughs (like curing cancer) require more than scaling test-time compute

Connect with Yash Sharma:

Yash Sharma

Let It Wag! Benchmark

Paper: Pretraining Frequency Predicts Compositional Generalization of CLIP (NeurIPS 2024 Workshop)

Connect with Neurometric:

Website: https://www.neurometric.ai/

Substack: https://neurometric.substack.com/

X: https://x.com/neurometric/

Bluesky: https://bsky.app/profile/neurometric.bsky.social

Rob May

https://x.com/robmay

https://www.linkedin.com/in/robmay

Calvin Cooper

https://x.com/cooper_nyc_

https://www.linkedin.com/in/coopernyc

...more

Share Benchmarking Generalization: How AI Learns Beyond Training Data

Sign up to save your podcasts

Benchmarking Generalization: How AI Learns Beyond Training Data

Benchmarking Generalization: How AI Learns Beyond Training Data