The Real Python Podcast

Preparing Data to Measure True Machine Learning Model Performance


Listen Later

How do you prepare a dataset for machine learning (ML)? How do you go beyond cleaning the data and move toward measuring how the model performs? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, returns to talk about strategies for better ML model performance.

Jodie starts by defining some terms for the conversation. We talk about targets, features, and supervised learning.

We discuss three common ways that data can alter model performance and which Python tools can help spot and avoid them. Jodie shares personal experiences of working through these pitfalls. We also share a healthy collection of resources to explore and learn more.

Course Spotlight: Combining Data in pandas With concat() and merge()

In this video course, you’ll learn two techniques for combining data in pandas: merge() and concat(). Combining Series and DataFrame objects in pandas is a powerful way to gain new insights into your data.

Topics:

  • 00:00:00 – Introduction
  • 00:01:46 – Recent conference talks
  • 00:03:24 – How to prepare your data for model performance
  • 00:04:24 – Vocabulary: target, features, and supervised learning
  • 00:06:28 – The curse of dimensionality
  • 00:08:57 – Overfitting
  • 00:11:08 – Underfitting
  • 00:12:11 – Splitting the dataset
  • 00:13:39 – K-fold cross validation
  • 00:18:30 – Data leakage
  • 00:21:36 – Checking for duplicates
  • 00:26:23 – Applying transformations only after splitting data
  • 00:31:16 – Imbalanced data
  • 00:36:36 – Using ML to balance data
  • 00:41:05 – Informing your model of the imbalance
  • 00:42:56 – Video Course Spotlight
  • 00:44:20 – Accuracy used as a measure
  • 00:49:05 – Scikit-learn method classification_table
  • 00:50:43 – Jet Brains blog post and conference talk
  • 00:52:18 – How can people follow your work online?
  • 00:54:39 – Upcoming webinars
  • 00:56:20 – Thanks and goodbye
  • Show Links:

    • How to Prepare Your Dataset for Machine Learning and Analysis - The JetBrains Datalore Blog
    • Curse of dimensionality - Wikipedia
    • Overfitting vs. Underfitting: A Complete Example - Will Koehrsen
    • A Gentle Introduction to k-fold Cross-Validation - MachineLearningMastery.com
    • sklearn.model_selection.train_test_split — scikit-learn documentation
    • Cross-validation: evaluating estimator performance — scikit-learn documentation
    • sklearn.model_selection.cross_val_score — scikit-learn documentation
    • Data Leakage And Its Effect On The Performance of An ML Model
    • pandas.DataFrame.duplicated — pandas documentation
    • pandas GroupBy: Your Guide to Grouping Data in Python – Real Python
    • pandas.DataFrame.groupby — pandas documentation
    • Difference between fit(), transform() and fit_transform() method in Scikit-learn - Aishwarya Chand: Nerd For Tech
    • Imbalanced Data in Machine Learning - Google Developers
    • Under-sampling — imbalanced-learn.org
    • Over-sampling — imbalanced-learn.org
    • Learn - Getting Started with Gretel.ai
    • Classification on imbalanced data: Class weights - TensorFlow Core
    • Tour of Evaluation Metrics for Imbalanced Classification - MachineLearningMastery.com
    • CloudBrew - A two-day conference by AZUG, the Belgium Microsoft Azure User Group
    • Jodie Burchell’s Blog - Standard error
    • Jodie Burchell 🇦🇺🇩🇪 (@t_redactyl) - Twitter
    • Jodie Burchell 🇦🇺🇩🇪 (@[email protected]) - Fosstodon
    • JetBrains: Essential tools for software developers and teams
    • Level up your Python skills with our expert-led courses:

      • Data Cleaning With pandas and NumPy
      • Sneaky REST APIs With Django Ninja
      • Combining Data in pandas With concat() and merge()
      • Support the podcast & join our community of Pythonistas

        ...more
        View all episodesView all episodes
        Download on the App Store

        The Real Python PodcastBy Real Python

        • 4.7
        • 4.7
        • 4.7
        • 4.7
        • 4.7

        4.7

        136 ratings


        More shows like The Real Python Podcast

        View all
        Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

        Software Engineering Radio - the podcast for professional software developers

        272 Listeners

        The Changelog: Software Development, Open Source by Changelog Media

        The Changelog: Software Development, Open Source

        283 Listeners

        Thoughtworks Technology Podcast by Thoughtworks

        Thoughtworks Technology Podcast

        41 Listeners

        Talk Python To Me by Michael Kennedy

        Talk Python To Me

        592 Listeners

        Software Engineering Daily by Software Engineering Daily

        Software Engineering Daily

        625 Listeners

        Soft Skills Engineering by Jamison Dance and Dave Smith

        Soft Skills Engineering

        269 Listeners

        Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

        Super Data Science: ML & AI Podcast with Jon Krohn

        296 Listeners

        Python Bytes by Michael Kennedy and Brian Okken

        Python Bytes

        213 Listeners

        Data Engineering Podcast by Tobias Macey

        Data Engineering Podcast

        142 Listeners

        Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

        Syntax - Tasty Web Development Treats

        983 Listeners

        DataFramed by DataCamp

        DataFramed

        266 Listeners

        Kubernetes Podcast from Google by Abdel Sghiouar, Kaslin Fields

        Kubernetes Podcast from Google

        181 Listeners

        Practical AI by Practical AI LLC

        Practical AI

        189 Listeners

        The Stack Overflow Podcast by The Stack Overflow Podcast

        The Stack Overflow Podcast

        64 Listeners

        The Pragmatic Engineer by Gergely Orosz

        The Pragmatic Engineer

        52 Listeners