Coding Chats episode 76 - John talks to Laura Dietz - a computer science professor whose work focuses on whether AI evaluation metrics actually tell the truth. She's known for her critical take on "LLM as a judge" — not because she thinks it's useless, but because she wants numbers that mean something rather than numbers that just make a system look good.
The conversation tackles some uncomfortable realities for software engineers: using an LLM to write code and another to review it is a circular trap, prompt engineering shouldn't be a computer scientist's day job, and every time you reject your code AI's output, you're quietly generating the training data that shapes its successor.
Chapters
00:00 Introduction to Laura Dietz and Her Journey
03:12 Exploring LLMs as Judges
06:16 Challenges in Evaluating Search Systems
08:49 The Evolution of User Queries and Expectations
11:46 The Role of LLMs in Information Retrieval
14:44 Defining Quality in Search Results
17:27 The Complexity of User Intent
19:54 Human-AI Collaboration in Code Review
22:53 The Future of LLMs in Software Development
25:23 Balancing Human and AI Roles
28:20 Innovative Approaches to AI Evaluation
34:10 The Art of Assembling Ideas
36:39 Balancing Cost and Quality in LLMs
39:09 Evaluating LLM Performance
43:50 The Future of LLMs and Training Data
49:19 Exploring New Architectures in AI
55:16 Understanding In-Context Learning
01:00:45 The Role of AI in Creative Expression
01:06:59 Exploring Related Content
Laura's Links:
https://www.cs.unh.edu/~dietz/https://
www.linkedin.com/in/laura-dietz-47036516/
John's Links:
John's LinkedIn: https://www.linkedin.com/in/johncrickett/
John’s YouTube: https://www.youtube.com/@johncrickett
John's Twitter: https://x.com/johncrickett
John's Bluesky: https://bsky.app/profile/johncrickett.bsky.social
Check out John's software engineering related newsletters: Coding Challenges: https://codingchallenges.substack.com/ which shares real-world project ideas that you can use to level up your coding skills.
Developing Skills: https://read.developingskills.fyi/ covering everything from system design to soft skills, helping them progress their career from junior to staff+ or for those that want onto a management track.
Takeaways
Using an LLM to both generate and evaluate outputs is circular — like a student grading their own homework.
If your evaluation metric can go up without your system actually improving, it's not a real metric.
A better human-in-the-loop isn't one that rubber-stamps AI suggestions — it's one that's guided to look in the right place.
LLMs don't get bored, which makes them genuinely useful for code review — but that's not the same as making them accurate.
"Faith-based engineering" — trusting AI output without validation — is a real and growing problem in software teams.
Prompt engineering is a workaround, not a discipline; real engineers should be building systems, not crafting incantations.
Every rejection you give your code AI is training signal — your frustration today is someone else's better tool tomorrow.
The transformer attention mechanism is a weighted sum, and a sum isn't always the right operation — some problems need an AND, not an OR.
AI tools are lowering the barrier to coding for people who were previously too intimidated to try, and that's worth celebrating.
The same network effect that makes a platform valuable also makes monopoly in AI training data genuinely dangerous.