April 24, 2025

Computation and Language - IberBench LLM Evaluation on Iberian Languages

5 minutes

Hey everyone, Ernis here, and welcome back to PaperLedge! Today, we're diving into a fascinating paper that tackles a HUGE problem in the world of AI: how do we actually know if these fancy Large Language Models – LLMs for short – are any good, especially when it comes to languages other than English?

Think of it like this: Imagine you're teaching a parrot to speak. You can easily test its English vocabulary, right? But what if you wanted it to speak Spanish, Portuguese, or even Basque? Suddenly, finding reliable ways to test its understanding becomes a lot harder. That's the challenge these researchers are addressing.

Most of the tests and leaderboards we use to rank LLMs are heavily focused on English. It's like judging the parrot's overall intelligence solely on its English skills. These tests also tend to focus on basic language skills, rather than the kinds of tasks businesses and organizations actually need these models to perform. And, once a test is created, it usually stays the same for a long time.

That's where IberBench comes in. These researchers created a brand-new benchmark – essentially a comprehensive test – specifically designed to evaluate LLMs on languages spoken across the Iberian Peninsula (think Spain and Portugal) and Ibero-America.

Now, what makes IberBench so special? Well, a few things:

It's diverse: IberBench includes a whopping 101 datasets covering 22 different types of tasks. We're talking everything from figuring out the sentiment of a text (is it positive or negative?) to detecting toxic language online, to even summarizing long articles.

It's practical: It focuses on tasks that are actually useful in the real world, not just academic exercises.

It's dynamic: Unlike those static tests I mentioned earlier, IberBench is designed to be constantly updated and improved by the community. Think of it as a living, breathing evaluation system.

The researchers then put 23 different LLMs, ranging in size from tiny to pretty darn big, through the IberBench wringer. And the results? Some interesting insights:

"LLMs perform worse on industry-relevant tasks than in fundamental ones."

This means that even if an LLM is great at basic language skills, it might struggle with tasks that are actually useful for businesses. It also turns out that some languages, like Galician and Basque, consistently see lower performance. And, shockingly, some tasks were basically a coin flip for the LLMs – they performed no better than random chance! For other tasks, LLMs did okay, but they still weren't as good as systems specifically designed for those tasks.

So, why does this matter? Well, for a few reasons:

For developers: IberBench provides a valuable tool for building and improving LLMs for these languages. It highlights where models are strong and where they need work.

For businesses: If you're thinking about using LLMs to automate tasks in Spanish, Portuguese, or other Iberian/Ibero-American languages, IberBench can help you choose the right model for your needs.

For everyone: By creating a more diverse and representative benchmark, IberBench helps to ensure that AI benefits everyone, not just English speakers.

And the best part? The entire evaluation pipeline, from the datasets to the leaderboard, is open-source. That means anyone can use it, contribute to it, and help make it even better!

So, what do you think, learning crew? Here are a couple of questions that popped into my head:

Given these results, what are the ethical implications of deploying LLMs in languages where their performance is significantly lower?

How can we encourage more community involvement in developing and maintaining benchmarks like IberBench?

I'm excited to hear your thoughts on this one. Until next time, keep learning!

Credit to Paper authors: José Ángel González, Ian Borrego Obrador, Álvaro Romo Herrero, Areg Mikael Sarvazyan, Mara Chinea-Ríos, Angelo Basile, Marc Franco-Salvador

...more

View all episodes

By ernestasposkus