O'Reilly Data Show Podcast

Why companies are in need of data lineage solutions


Listen Later

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up.

There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.

Here are some highlights from our conversation:

Data lineage

Data lineage is not something new. It’s something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I’m describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful.

… Think about data lineage as helping issues about quality of data, understanding if something is corrupted. On the security side, think of GDPR … which was one of the hot topics I heard about at the Strata Data Conference in London in 2018.

Why companies are suddenly building data lineage solutions

A data lineage system becomes necessary as time progresses. It becomes easier for maintainability. You need it for audit trails, for security and compliance. But you also need to think of the benefit of managing the data sets you’re working with. If you’re working with 10 databases, you need to know what’s going on in them. If I have to give you a vision of a data lineage system, think of it as a final graph or view of some data set, and it shows you a graph of what it’s linked to. Then it gives you some metadata information so you can drill down. Let’s say you have corrupted data, let’s say you want to debug something. All these cases tie into the actual use cases for which we want to build it.

Related resources:

  • “Deep automation in machine learning”
  • Vitaly Gordon on “Building tools for enterprise data science”
  • “Managing risk in machine learning”
  • Haoyuan Li explains why “In the age of AI, fundamental value resides in data”
  • “What machine learning means for software development”
  • Joe Hellerstein on how “Metadata services can lead to performance and organizational improvements”
  • ...more
    View all episodesView all episodes
    Download on the App Store

    O'Reilly Data Show PodcastBy O'Reilly Media

    • 4
    • 4
    • 4
    • 4
    • 4

    4

    63 ratings


    More shows like O'Reilly Data Show Podcast

    View all
    The Changelog: Software Development, Open Source by Changelog Media

    The Changelog: Software Development, Open Source

    283 Listeners

    O'Reilly Radar Podcast - O'Reilly Media Podcast by O'Reilly Media

    O'Reilly Radar Podcast - O'Reilly Media Podcast

    36 Listeners

    Data Skeptic by Kyle Polich

    Data Skeptic

    482 Listeners

    Talk Python To Me by Michael Kennedy

    Talk Python To Me

    592 Listeners

    Software Engineering Daily by Software Engineering Daily

    Software Engineering Daily

    623 Listeners

    O'Reilly Design Podcast - O'Reilly Media Podcast by O'Reilly Media

    O'Reilly Design Podcast - O'Reilly Media Podcast

    8 Listeners

    The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

    The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

    446 Listeners

    AWS Podcast by Amazon Web Services

    AWS Podcast

    202 Listeners

    Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

    Super Data Science: ML & AI Podcast with Jon Krohn

    297 Listeners

    NVIDIA AI Podcast by NVIDIA

    NVIDIA AI Podcast

    323 Listeners

    Machine Learning Guide by OCDevel

    Machine Learning Guide

    764 Listeners

    AI Today Podcast by AI & Data Today

    AI Today Podcast

    146 Listeners

    DataFramed by DataCamp

    DataFramed

    267 Listeners

    Practical AI by Practical AI LLC

    Practical AI

    192 Listeners

    Google DeepMind: The Podcast by Hannah Fry

    Google DeepMind: The Podcast

    197 Listeners

    Last Week in AI by Skynet Today

    Last Week in AI

    287 Listeners

    This Day in AI Podcast by Michael Sharkey, Chris Sharkey

    This Day in AI Podcast

    199 Listeners