How can you measure the quality of a large language model? What tools can measure bias, toxicity, and truthfulness levels in a model using Python? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, returns to discuss techniques and tools for evaluating LLMs With Python.
Jodie provides some background on large language models and how they can absorb vast amounts of information about the relationship between words using a type of neural network called a transformer. We discuss training datasets and the potential quality issues with crawling uncurated sources.
We dig into ways to measure levels of bias, toxicity, and hallucinations using Python. Jodie shares three benchmarking datasets and links to resources to get you started. We also discuss ways to augment models using agents or plugins, which can access search engine results or other authoritative sources.
This week’s episode is brought to you by Intel.
Course Spotlight: Learn Text Classification With Python and Keras
In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.
00:00:00 – Introduction00:02:19 – Testing characteristics of LLMs with Python00:04:18 – Background on LLMs00:08:35 – Training of models00:14:23 – Uncurated sources of training00:16:12 – Safeguards and prompt engineering00:21:19 – TruthfulQA and creating a more strict prompt00:23:20 – Information that is out of date00:26:07 – WinoBias for evaluating gender stereotypes00:28:30 – BOLD dataset for evaluating bias00:30:28 – Sponsor: Intel00:31:18 – Using Hugging Face to start testing with Python00:35:25 – Using the transformers package00:37:34 – Using langchain for proprietary models00:43:04 – Putting the tools together and evaluating00:47:19 – Video Course Spotlight00:48:29 – Assessing toxicity00:50:21 – Measuring bias00:54:40 – Checking the hallucination rate00:56:22 – LLM leaderboards00:58:17 – What helped ChatGPT leap forward?01:06:01 – Improvements of what is being crawled01:07:32 – Revisiting agents and RAG01:11:03 – ChatGPT plugins and Wolfram-Alpha01:13:06 – How can people follow your work online?01:14:33 – Thanks and goodbyeA Beginner’s Guide to Language Models - Built InChatGPT - Explained! - YouTubetruthful_qa - Datasets at Hugging Facewino_bias - Datasets at Hugging Facebold - Datasets at Hugging FaceTutorials and Documentation for Python Packages:
Evaluating Language Model Bias with 🤗 EvaluateHugging Face - HF_bias_evaluation - Google ColabGeneral Usage - Load a Dataset - Hugging FaceWhat is Text Generation? - Hugging Face🤗 Evaluate - Library Evaluating ML ModelsPython Quickstart - 🦜️🔗 LangchainToxicity - a Hugging Face Space by evaluate-measurementRegard - a Hugging Face Space by evaluate-measurementOpen LLM Leaderboard - a Hugging Face SpaceCommon Crawl - Open Repository of Web Crawl DataThe PileThe RefinedWeb Dataset for Falcon LLM: Outperforming Curated CorporaTransformers Agents - Hugging FaceAgents - 🦜️🔗 LangchainChatGPT Gets Its “Wolfram Superpowers”! - Stephen WolframInside the AI Factory: The Humans that Make Tech Seem Human - The VergeJodie Burchell - The JetBrains BlogJodie Burchell’s Blog - Standard errorJodie Burchell 🇦🇺🇩🇪 (@t_redactyl) - TwitterJodie Burchell 🇦🇺🇩🇪 (@[email protected]) - FosstodonJetBrains: Essential tools for software developers and teamsLevel up your Python skills with our expert-led courses:
Data Cleaning With pandas and NumPyCreating Web Maps From Your Data With Python FoliumLearn Text Classification With Python and Keras Support the podcast & join our community of Pythonistas