
Sign up to save your podcasts
Or


What if the biggest bottleneck in AI agent performance isn’t the model itself—but what it doesn’t know how to do?
In this episode, we explore SkillsBench, the first benchmark that systematically measures how structured procedural knowledge—called Agent Skills—impacts AI agent performance across real-world tasks. The results are striking: curated Skills boost agent success rates by 16 percentage points on average, with some domains like Healthcare seeing gains above 50 points. But here’s the twist—when models try to generate their own Skills, performance actually drops. The takeaway? AI agents desperately need human expertise to unlock their full potential.
Inspired by the work of Xiangyi Li, Wenbo Chen, Yimin Liu, and colleagues, this episode was created using Google’s NotebookLM.
Read the original paper here: https://arxiv.org/pdf/2602.12670
By Anlie Arnaudy, Daniel Herbera and Guillaume FournierWhat if the biggest bottleneck in AI agent performance isn’t the model itself—but what it doesn’t know how to do?
In this episode, we explore SkillsBench, the first benchmark that systematically measures how structured procedural knowledge—called Agent Skills—impacts AI agent performance across real-world tasks. The results are striking: curated Skills boost agent success rates by 16 percentage points on average, with some domains like Healthcare seeing gains above 50 points. But here’s the twist—when models try to generate their own Skills, performance actually drops. The takeaway? AI agents desperately need human expertise to unlock their full potential.
Inspired by the work of Xiangyi Li, Wenbo Chen, Yimin Liu, and colleagues, this episode was created using Google’s NotebookLM.
Read the original paper here: https://arxiv.org/pdf/2602.12670