Data Engineering Podcast

Strategies For Proactive Data Quality Management


Listen Later

Summary

Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
  • We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
  • Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy about strategies for proactive data quality management and his work at Datafold to help provide tools for implementing them
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what you are building at Datafold and the story behind it?
    • What are the biggest factors that you see contributing to data quality issues?
      • How are teams identifying and addressing those failures?
      • How does the data platform architecture impact the potential for introducing quality problems?
      • What are some of the potential risks or consequences of introducing errors in data processing?
      • How can organizations shift to being proactive in their data quality management?
        • How much of a role does tooling play in addressing the introduction and remediation of data quality problems?
        • Can you describe how Datafold is designed and architected to allow for proactive management of data quality?
          • What are some of the original goals and assumptions about how to empower teams to improve data quality that have been challenged or changed as you have worked through building Datafold?
          • What is the workflow for an individual or team who is using Datafold as part of their data pipeline and platform development?
          • What are the organizational patterns that you have found to be most conducive to proactive data quality management?
            • Who is responsible for identifying and addressing quality issues?
            • What are the most interesting, innovative, or unexpected ways that you have seen Datafold used?
            • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold?
            • When is Datafold the wrong choice?
            • What do you have planned for the future of Datafold?
            • Contact Info
              • LinkedIn
              • @glebmm on Twitter
              • Parting Question
                • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                • Closing Announcements
                  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
                  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
                  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
                  • Links
                    • Datafold
                    • Autodesk
                    • Airflow
                      • Podcast.__init__ Episode
                      • Spark
                      • Looker
                        • Podcast Episode
                        • Amundsen
                          • Podcast Episode
                          • dbt
                            • Podcast Episode
                            • Dagster
                              • Podcast Episode
                              • Podcast.__init__ Episode
                              • Change Data Capture
                                • Podcast Episodes
                                • Delta Lake
                                  • Podcast Episode
                                  • Trino
                                    • Podcast Episode
                                    • Presto
                                    • Parquet
                                      • Podcast Episode
                                      • Data Quality Meetup
                                      • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                                        Special Guest: Gleb Mezhanskiy.

                                        Support Data Engineering Podcast

                                        ...more
                                        View all episodesView all episodes
                                        Download on the App Store

                                        Data Engineering PodcastBy Tobias Macey

                                        • 4.6
                                        • 4.6
                                        • 4.6
                                        • 4.6
                                        • 4.6

                                        4.6

                                        135 ratings


                                        More shows like Data Engineering Podcast

                                        View all
                                        Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                                        Software Engineering Radio - the podcast for professional software developers

                                        273 Listeners

                                        The Changelog: Software Development, Open Source by Changelog Media

                                        The Changelog: Software Development, Open Source

                                        282 Listeners

                                        The Cloudcast by Massive Studios

                                        The Cloudcast

                                        152 Listeners

                                        Thoughtworks Technology Podcast by Thoughtworks

                                        Thoughtworks Technology Podcast

                                        42 Listeners

                                        Data Skeptic by Kyle Polich

                                        Data Skeptic

                                        481 Listeners

                                        Talk Python To Me by Michael Kennedy

                                        Talk Python To Me

                                        591 Listeners

                                        Software Engineering Daily by Software Engineering Daily

                                        Software Engineering Daily

                                        625 Listeners

                                        The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                                        The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                                        444 Listeners

                                        Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                                        Super Data Science: ML & AI Podcast with Jon Krohn

                                        297 Listeners

                                        Python Bytes by Michael Kennedy and Brian Okken

                                        Python Bytes

                                        213 Listeners

                                        DataFramed by DataCamp

                                        DataFramed

                                        265 Listeners

                                        Practical AI by Practical AI LLC

                                        Practical AI

                                        192 Listeners

                                        The Stack Overflow Podcast by The Stack Overflow Podcast

                                        The Stack Overflow Podcast

                                        64 Listeners

                                        The Real Python Podcast by Real Python

                                        The Real Python Podcast

                                        140 Listeners

                                        Latent Space: The AI Engineer Podcast by swyx + Alessio

                                        Latent Space: The AI Engineer Podcast

                                        77 Listeners