Data Engineering Podcast

Bringing Automation To Data Labeling For Machine Learning With Watchful


Listen Later

Summary

Data engineers have typically left the process of data labeling to data scientists or other roles because of its nature as a manual and process heavy undertaking, focusing instead on building automation and repeatable systems. Watchful is a platform to make labeling a repeatable and scalable process that relies on codifying domain expertise. In this episode founder Shayan Mohanty explains how he and his team are bringing software best practices and automation to the world of machine learning data preparation and how it allows data engineers to be involved in the process.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
  • Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
  • The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
  • Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
  • Your host is Tobias Macey and today I’m interviewing Shayan Mohanty about Watchful, a data-centric platform for labeling your machine learning inputs
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Watchful is and the story behind it?
    • What are your core goals at Watchful?
      • What problem are you solving and who are the people most impacted by that problem?
      • What is the role of the data engineer in the process of getting data labeled for machine learning projects?
      • Data labeling is a large and competitive market. How do you characterize the different approaches offered by the various platforms and services?
      • What are the main points of friction involved in getting data labeled?
        • How do the types of data and its applications factor into how those challenges manifest?
        • What does Watchful provide that allows it to address those obstacles?
        • Can you describe how Watchful is implemented?
          • What are some of the initial ideas/assumptions that you have had to re-evaluate?
          • What are some of the ways that you have had to adjust the design of your user experience flows since you first started?
          • What is the workflow for teams who are adopting Watchful?
            • What are the types of collaboration that need to happen in the data labeling process?
            • What are some of the elements of shared vocabulary that different stakeholders in the process need to establish to be successful?
            • What are the most interesting, innovative, or unexpected ways that you have seen Watchful used?
            • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Watchful?
            • When is Watchful the wrong choice?
            • What do you have planned for the future of Watchful?
            • Contact Info
              • LinkedIn
              • @shayanjm on Twitter
              • Parting Question
                • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                • Closing Announcements
                  • Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
                  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
                  • Links
                    • Watchful
                    • Entity Resolution
                    • Supervised Machine Learning
                    • BERT
                    • CLIP
                    • LabelBox
                    • Label Studio
                    • Snorkel AI
                      • Machine Learning Podcast Episode
                      • RegEx == Regular Expression
                      • REPL == Read Evaluate Print Loop
                      • IDE == Integrated Development Environment
                      • Turing Completeness
                      • Clojure
                      • Rust
                      • Named Entity Recognition
                      • The Halting Problem
                      • NP Hard
                      • Lidar
                      • Shayan: Arguments Against Hand Labeling
                      • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                        Support Data Engineering Podcast

                        ...more
                        View all episodesView all episodes
                        Download on the App Store

                        Data Engineering PodcastBy Tobias Macey

                        • 4.6
                        • 4.6
                        • 4.6
                        • 4.6
                        • 4.6

                        4.6

                        134 ratings


                        More shows like Data Engineering Podcast

                        View all
                        Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                        Software Engineering Radio - the podcast for professional software developers

                        262 Listeners

                        The Changelog: Software Development, Open Source by Changelog Media

                        The Changelog: Software Development, Open Source

                        286 Listeners

                        The Cloudcast by Massive Studios

                        The Cloudcast

                        154 Listeners

                        Thoughtworks Technology Podcast by Thoughtworks

                        Thoughtworks Technology Podcast

                        42 Listeners

                        Data Skeptic by Kyle Polich

                        Data Skeptic

                        474 Listeners

                        Talk Python To Me by Michael Kennedy

                        Talk Python To Me

                        584 Listeners

                        Software Engineering Daily by Software Engineering Daily

                        Software Engineering Daily

                        630 Listeners

                        AWS Podcast by Amazon Web Services

                        AWS Podcast

                        200 Listeners

                        Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                        Super Data Science: ML & AI Podcast with Jon Krohn

                        293 Listeners

                        Python Bytes by Michael Kennedy and Brian Okken

                        Python Bytes

                        212 Listeners

                        DataFramed by DataCamp

                        DataFramed

                        270 Listeners

                        Practical AI by Practical AI LLC

                        Practical AI

                        196 Listeners

                        The Stack Overflow Podcast by The Stack Overflow Podcast

                        The Stack Overflow Podcast

                        63 Listeners

                        The Real Python Podcast by Real Python

                        The Real Python Podcast

                        137 Listeners

                        Latent Space: The AI Engineer Podcast by swyx + Alessio

                        Latent Space: The AI Engineer Podcast

                        64 Listeners