May 04, 2025

Machine Learning - MINERVA Evaluating Complex Video Reasoning

6 minutes

Hey PaperLedge crew, Ernis here, ready to dive into something super cool – a new way to test how well AI really understands videos! Think of it like this: you can teach a computer to recognize a cat in a photo, right? But what if you want it to understand a cat jumping on a table, knocking over a vase, and then looking guilty? That’s where things get tricky.

See, most of the tests we use for video understanding are pretty basic. They just ask a question about the outcome – like, “Did the vase break?” – without caring how the AI got the answer. It’s like giving a student a multiple-choice test without asking them to show their work. They might get the right answer by guessing or just recognizing a pattern in the questions, not because they actually understand the video.

That's where this paper comes in. These researchers were like, “Hold on, we need a better way to check if AI is actually reasoning about videos!” So, they created a new dataset called MINERVA. It’s like a super-detailed video quiz designed to really push AI's understanding.

What makes MINERVA so special? Well, a few things:

Multimodal: It uses both video and text. The AI needs to watch the video and understand the question to answer correctly.

Diverse: The videos are from all sorts of places – think sports, cooking shows, cartoons… a real mixed bag!

Complex: The questions aren’t simple yes/no stuff. They often require multiple steps of reasoning. It's not just "Did the ball go in the net?" but more like "What happened before the ball went in the net that made it possible?"

Reasoning Traces: This is the killer feature. For each question, there's a detailed, hand-crafted explanation of how a human would arrive at the correct answer. It's like having the answer key and the step-by-step solution!

The researchers put some of the most advanced AI models through the MINERVA test, and guess what? They struggled! This showed that even the best AIs are still missing something when it comes to truly understanding videos.

But the paper doesn’t just point out the problem. The researchers also dug deep into why these AIs were failing. They found that the biggest issues were:

Temporal Localization: Basically, figuring out when things happen in the video. It’s like the AI is watching the whole movie at once instead of following the plot in order.

Visual Perception Errors: Misinterpreting what they’re seeing in the video. Maybe mistaking a red ball for an orange one, or not noticing a subtle change in someone's expression.

Interestingly, the AIs were less likely to make errors in logic or in putting the pieces together once they had the right information. This suggests that the main challenge is getting the AI to see and track what’s happening in the video accurately.

So, why does all of this matter?

For AI Developers: MINERVA provides a valuable benchmark for improving video understanding models. It highlights specific areas where AI needs to improve.

For Researchers: The dataset and analysis offer insights into the challenges of multimodal reasoning and the limitations of current AI systems.

For Everyone Else: As AI becomes more integrated into our lives – from self-driving cars to video surveillance – it’s crucial that it can accurately understand what’s happening in the world around it. This research helps us move closer to that goal.

“Our dataset provides a challenge for frontier open-source and proprietary models.”

The researchers are even sharing their dataset online, so anyone can use it to test and improve their AI models. How cool is that?! You can find it at https://github.com/google-deepmind/neptune?tab=readme-ov-file#minerva.

Okay, learning crew, time for some food for thought. Here are a couple of things that popped into my head:

Given that temporal reasoning is such a bottleneck, could we train AI specifically on understanding timelines and event sequences before exposing it to complex videos?

If we can teach AI to explain its reasoning process (like MINERVA does), could we use that to identify and correct its mistakes more easily?

What do you all think? Let me know your thoughts in the comments! Until next time, keep exploring the PaperLedge!

Credit to Paper authors: Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, Tobias Weyand

...more

View all episodes

By ernestasposkus