May 04, 2025

Computation and Language - TheAgentCompany Benchmarking LLM Agents on Consequential Real World Tasks

5 minutes

Alright Learning Crew, Ernis here, ready to dive into some seriously cool research! Today, we're talking about AI agents in the workplace. Now, we all use computers every day, right? Think about your work – how much of it is digital, done online? Well, researchers are wondering just how good AI is getting at actually doing some of that work for us.

We've seen these amazing leaps in AI, especially with Large Language Models (LLMs). These aren't just chatbots anymore; they're AI agents that can interact with their environment, like browsing the web or even writing code. So, the big question is: can these AI agents actually perform real-world professional tasks?

This is a HUGE deal for companies thinking about using AI, and also for policymakers trying to figure out what AI means for jobs. Are robots going to take over? Or can they just help us be more efficient?

That's where this paper comes in. These researchers created something called TheAgentCompany. Think of it as a digital playground, a simulated small software company. It's got everything a real company has: internal websites, data, and even tasks that employees would normally do.

They built this environment specifically to test AI agents. The agents have to browse the web, write code, run programs, and even communicate with “coworkers” (other AI or simulated humans). It's like The Sims, but for AI and work!

"We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company."

So, what did they find? Well, they tested a bunch of different AI agents, some powered by big companies' APIs (like OpenAI), and others using open-source models. The results are… interesting. The best agent could complete about 24% of the tasks completely on its own.

That might not sound like much, but think about it: almost a quarter of the tasks could be automated! That's a good starting point. It's like having an intern who can reliably handle some of the easier, more routine jobs without needing constant supervision.

But here's the catch: the more complex, long-term tasks? Still beyond the reach of current AI. Think of it like this: AI can probably write a simple email, but it can't yet manage an entire marketing campaign from start to finish.

So, what does this all mean? This research paints a pretty nuanced picture. AI is getting good at automating simpler tasks in a workplace setting, but we're still a ways off from fully autonomous digital workers. There still a lot of human in the loop!

Here are a couple of questions that popped into my head:

If AI can handle 24% of tasks autonomously now, how quickly will that number increase in the next few years?

What are the ethical implications of using AI agents in the workplace, especially when it comes to job displacement and data privacy?

This is definitely a conversation we need to keep having, Learning Crew. What do you think? Let me know your thoughts!

Credit to Paper authors: Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig

...more

View all episodes

By ernestasposkus

May 04, 2025

Computation and Language - TheAgentCompany Benchmarking LLM Agents on Consequential Real World Tasks

5 minutes

Here are a couple of questions that popped into my head:

If AI can handle 24% of tasks autonomously now, how quickly will that number increase in the next few years?

What are the ethical implications of using AI agents in the workplace, especially when it comes to job displacement and data privacy?

This is definitely a conversation we need to keep having, Learning Crew. What do you think? Let me know your thoughts!

...more

Share Computation and Language - TheAgentCompany Benchmarking LLM Agents on Consequential Real World Tasks

Sign up to save your podcasts

Computation and Language - TheAgentCompany Benchmarking LLM Agents on Consequential Real World Tasks

Computation and Language - TheAgentCompany Benchmarking LLM Agents on Consequential Real World Tasks