In the second episode of Gradient Descent, Vishnu Vettrivel (CTO of Wisecube) and Alex Thomas (Principal Data Scientist) explore the innovative yet controversial idea of using LLMs to judge and evaluate other AI systems. They discuss the hidden human role in AI training, limitations of traditional benchmarks, automated evaluation strengths and weaknesses, and best practices for building reliable AI judgment systems.
Timestamps:
00:00 – Introduction & Context
01:00 – The Role of Humans in AI
03:58 – Why Is Evaluating LLMs So Difficult?
09:00 – Pros and Cons of LLM-as-a-Judge
14:30 – How to Make LLM-as-a-Judge More Reliable?
19:30 – Trust and Reliability Issues
25:00 – The Future of LLM-as-a-Judge
30:00 – Final Thoughts and Takeaways
Listen on:
• YouTube: https://youtube.com/@WisecubeAI/podcasts
• Apple Podcast: https://apple.co/4kPMxZf
• Spotify: https://open.spotify.com/show/1nG58pwg2Dv6oAhCTzab55
• Amazon Music: https://bit.ly/4izpdO2
Our solutions: • https://askpythia.ai/ - LLM Hallucination Detection Tool
• https://www.wisecube.ai - Wisecube AI platform for large-scale biomedical knowledge analysis
Follow us:
• Pythia Website: www.askpythia.ai
• Wisecube Website: www.wisecube.ai
• Linkedin: www.linkedin.com/company/wisecube
• Facebook: www.facebook.com/wisecubeai
• Reddit: www.reddit.com/r/pythia/
Mentioned Materials:
- Best Practices for LLM-as-a-Judge: https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG
- LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods: https://arxiv.org/pdf/2412.05579v2
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena: https://arxiv.org/abs/2306.05685
- Guide to LLM-as-a-Judge: https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- Preference Leakage: A Contamination Problem in LLM-as-a-Judge: https://arxiv.org/pdf/2502.01534
- Large Language Models Are Not Fair Evaluators: https://arxiv.org/pdf/2305.17926
- Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment: https://arxiv.org/pdf/2402.14016v2
- Optimization-based Prompt Injection Attack to LLM-as-a-Judge: https://arxiv.org/pdf/2403.17710v4
- AWS Bedrock: Model Evaluation: https://aws.amazon.com/blogs/machine-learning/llm-as-a-judge-on-amazon-bedrock-model-evaluation/
- Hugging Face: LLM Judge Cookbook: https://huggingface.co/learn/cookbook/en/llm_judge