May 19, 2025

The Agent Company Benchmark: Evaluating AI's Real-World Capabilities

32 minutes

The dizzying pace of AI development has sparked fierce debate about automation in the workplace, with headlines swinging between radical transformation and cautious scepticism. What's been missing is concrete, objective evidence about what AI agents can actually do in real professional settings.

TLDR:

The Agent Company benchmark creates a simulated software company environment to test AI agents on realistic work tasks
AI agents must navigate digital tools including GitLab, OwnCloud, Plane, and RocketChat while interacting with simulated colleagues
Even the best-performing model (Claude 3.5 Sonnet) only achieved 24% full completion rate across all tasks
Surprisingly, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial tasks
Common failure modes include lack of common sense, poor social intelligence, and inability to navigate complex web interfaces
Performance differences likely reflect biases in available training data, with coding having much more public data than administrative tasks
The gap between open source and closed source models appears to be narrowing, suggesting wider future access to capable AI systems

The Agent Company benchmark offers that much-needed reality check. By creating a fully simulated software company environment with all the digital tools professionals use daily—code repositories, file sharing, project management systems, and communication platforms—researchers can now rigorously evaluate how AI agents perform on authentic workplace tasks.

The results are eye-opening.

Even the best models achieved just 24% full completion across 175 tasks, with particularly poor performance in social interaction and navigating complex software interfaces.

Counterintuitively, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial work—likely reflecting biases in available training data.

The specific failure patterns reveal fundamental limitations: agents struggle with basic common sense (not recognizing a Word document requires word processing software), social intelligence (missing obvious implied actions after conversations), and web navigation (getting stuck on routine pop-ups that humans dismiss without thinking).

In one telling example, an agent unable to find a specific colleague attempted to rename another user to the desired name—a bizarre workaround that highlights flawed problem-solving approaches.

For anyone trying to separate AI hype from reality, this benchmark provides crucial context. While showing meaningful progress in automating certain professional tasks, it confirms that human adaptability, contextual understanding, and social navigation remain essential in the modern workplace.

Curious to explore the benchmark yourself? Visit theagentcompany.com or find the code repository on GitHub to join this important conversation about the future of work.

Research: THEAGENTCOMPANY: benchmarking llm agents on consequential real-world tasks

Support the show

𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.

☎️ https://calendly.com/kierangilmurray/results-not-excuses
✉️ [email protected]
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
🦉 X / Twitter: https://twitter.com/KieranGilmurray
📽 YouTube: https://www.youtube.com/@KieranGilmurray

📕 Want to learn more about agentic AI then read my new book on Agentic AI and the Future of Work https://tinyurl.com/MyBooksOnAmazonUK

...more

View all episodes

By Kieran Gilmurray

May 19, 2025

The Agent Company Benchmark: Evaluating AI's Real-World Capabilities

32 minutes

TLDR:

The Agent Company benchmark creates a simulated software company environment to test AI agents on realistic work tasks
AI agents must navigate digital tools including GitLab, OwnCloud, Plane, and RocketChat while interacting with simulated colleagues
Even the best-performing model (Claude 3.5 Sonnet) only achieved 24% full completion rate across all tasks
Surprisingly, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial tasks
Common failure modes include lack of common sense, poor social intelligence, and inability to navigate complex web interfaces
Performance differences likely reflect biases in available training data, with coding having much more public data than administrative tasks
The gap between open source and closed source models appears to be narrowing, suggesting wider future access to capable AI systems

Even the best models achieved just 24% full completion across 175 tasks, with particularly poor performance in social interaction and navigating complex software interfaces.

Research: THEAGENTCOMPANY: benchmarking llm agents on consequential real-world tasks

Support the show

...more

Share The Agent Company Benchmark: Evaluating AI's Real-World Capabilities

Sign up to save your podcasts

The Agent Company Benchmark: Evaluating AI's Real-World Capabilities

The Agent Company Benchmark: Evaluating AI's Real-World Capabilities