June 03, 2025

Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode 008

1 hour

Session Topics:

The Llama 4 Controversy and Evaluation Mechanism Failure
Llama 4’s initial high ELO score on LM Arena was driven by optimizations for human preferences—such as the use of emojis and overly positive tone. When these were removed, performance dropped significantly. This exposed weaknesses in existing evaluation mechanisms and raised concerns about benchmark reliability.

Two Levels of AI Evaluation
There are two main types of AI evaluation: model-level benchmarking for foundational models (e.g., Gemini, Claude), and use-case-specific evaluations for deployed AI systems—especially Retrieval Augmented Generation (RAG) systems.

Benchmarking Foundational Models
Benchmarks such as MMLU (world knowledge), MMU (multimodal understanding), GPQA (expert-level reasoning), ARC AGI (reasoning tasks), and newer ones like Code ELO and SWEBench (software engineering tasks) are commonly used to assess foundational model performance.

Evaluating Conversational and Agentic LLMs
The Multi-Challenge benchmark by Scale AI evaluates multi-turn conversational capabilities, while the Tow Benchmark assesses how well agentic LLMs perform tasks like interacting with and modifying databases.

Use Case Specific Evaluation and RAG Systems
Use-case-specific evaluation is critical for RAG systems that rely on organizational data to generate context. One example illustrated a car-booking agent returning a cheesecake recipe—underscoring the risks of unexpected model behaviour.

Ragas Framework for Evaluating RAG Systems
Ragas and DeepEval offer evaluation metrics such as context precision, response relevance, and faithfulness. These frameworks can compare model outputs against ground truth to assess both retrieval and generation components.

The Leaderboard Illusion in Model Evaluation
Leaderboards like LM Arena may present a distorted picture, as large organisations submit multiple hidden models to optimise final rankings—misleading users about true model performance.

Using LLMs to Evaluate Other LLMs: Advantages and Risks
LLMs can be used to evaluate other LLMs for scalability, but this introduces risks such as bias and false positives. Fourteen common design flaws have been identified in LLM-on-LLM evaluation systems.

Circularity and LLM Narcissism in Evaluation
Circularity arises when evaluator feedback influences the model being tested. LLM narcissism describes a model favouring outputs similar to its own, distorting evaluation outcomes.

Label Correlation and Test Set Leaks
Label correlation occurs when human and model evaluators agree on flawed outputs. Test set leaks happen when models have seen benchmark data during training, compromising result accuracy.

The Need for Use Case Specific Model Evaluation
General benchmarks alone are increasingly inadequate. Tailored, context-driven evaluations are essential to determine real-world suitability and performance of AI models.

...more

View all episodes

By Dillan Leslie-Rowe

June 03, 2025

Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode 008

1 hour

Session Topics:

...more

Share Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode 008

Sign up to save your podcasts

Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode 008

Latest Artificial Intelligence R&D Session - With Digitalent & Mike Nedelko - Episode 008