O'Reilly Data Show Podcast

Labeling, transforming, and structuring training data sets for machine learning


Listen Later

In this episode of the Data Show, I speak with Alex Ratner, project lead for Stanford’s Snorkel open source project; Ratner also recently garnered a faculty position at the University of Washington and is currently working on a company supporting and extending the Snorkel project. Snorkel is a framework for building and managing training data. Based on our survey from earlier this year, labeled data remains a key bottleneck for organizations building machine learning applications and services.

Ratner was a guest on the podcast a little over two years ago when Snorkel was a relatively new project. Since then, Snorkel has added more features, expanded into computer vision use cases, and now boasts many users, including Google, Intel, IBM, and other organizations. Along with his thesis advisor professor Chris Ré of Stanford, Ratner and his collaborators have long championed the importance of building tools aimed squarely at helping teams build and manage training data. With today’s release of Snorkel version 0.9, we are a step closer to having a framework that enables the programmatic creation of training data sets.

Snorkel pipeline for data labeling. Source: Alex Ratner, used with permission.

We had a great conversation spanning many topics, including:

  • Why he and his collaborators decided to focus on “data programming” and tools for building and managing training data.
  • A tour through Snorkel, including its target users and key components.
  • What’s in the newly released version (v 0.9) of Snorkel.
  • The number of Snorkel’s users has grown quite a bit since we last spoke, so we went through some of the common use cases for the project.
  • Data lineage, AutoML, and end-to-end automation of machine learning pipelines.
  • Holoclean and other projects focused on data quality and data programming.
  • The need for tools that can ease the transition from raw data to derived data (e.g., entities), insights, and even knowledge.
  • Related resources:

    • “Product management in the machine learning era”: A tutorial at the Artificial Intelligence Conference in San Jose, September 9-12, 2019.
    • Chris Ré: “Software 2.0 and Snorkel”
    • Alex Ratner: “Creating large training data sets quickly”
    • Ihab Ilyas and Ben Lorica on “The quest for high-quality data”
    • Roger Chen: “Acquiring and sharing high-quality data”
    • Jeff Jonas on “Real-time entity resolution made accessible”
    • “Data collection and data markets in the age of privacy and machine learning”
    • ...more
      View all episodesView all episodes
      Download on the App Store

      O'Reilly Data Show PodcastBy O'Reilly Media

      • 4
      • 4
      • 4
      • 4
      • 4

      4

      63 ratings


      More shows like O'Reilly Data Show Podcast

      View all
      The Changelog: Software Development, Open Source by Changelog Media

      The Changelog: Software Development, Open Source

      285 Listeners

      O'Reilly Radar Podcast - O'Reilly Media Podcast by O'Reilly Media

      O'Reilly Radar Podcast - O'Reilly Media Podcast

      35 Listeners

      Data Skeptic by Kyle Polich

      Data Skeptic

      475 Listeners

      Talk Python To Me by Michael Kennedy

      Talk Python To Me

      580 Listeners

      Software Engineering Daily by Software Engineering Daily

      Software Engineering Daily

      624 Listeners

      O'Reilly Design Podcast - O'Reilly Media Podcast by O'Reilly Media

      O'Reilly Design Podcast - O'Reilly Media Podcast

      8 Listeners

      AWS Podcast by Amazon Web Services

      AWS Podcast

      203 Listeners

      Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

      Super Data Science: ML & AI Podcast with Jon Krohn

      295 Listeners

      Python Bytes by Michael Kennedy and Brian Okken

      Python Bytes

      214 Listeners

      Data Engineering Podcast by Tobias Macey

      Data Engineering Podcast

      139 Listeners

      DataFramed by DataCamp

      DataFramed

      266 Listeners

      Practical AI by Practical AI LLC

      Practical AI

      196 Listeners

      Google DeepMind: The Podcast by Hannah Fry

      Google DeepMind: The Podcast

      188 Listeners

      Me, Myself, and AI by MIT Sloan Management Review and Boston Consulting Group (BCG)

      Me, Myself, and AI

      99 Listeners

      AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning by Jaeden Schafer

      AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning

      139 Listeners

      This Day in AI Podcast by Michael Sharkey, Chris Sharkey

      This Day in AI Podcast

      178 Listeners

      The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis by Nathaniel Whittemore

      The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

      397 Listeners