The Python Podcast.__init__

Version Control For Your Machine Learning Projects


Listen Later

Summary

Version control has become table stakes for any software team, but for machine learning projects there has been no good answer for tracking all of the data that goes into building and training models, and the output of the models themselves. To address that need Dmitry Petrov built the Data Version Control project known as DVC. In this episode he explains how it simplifies communication between data scientists, reduces duplicated effort, and simplifies concerns around reproducing and rebuilding models at different stages of the projects lifecycle. If you work as part of a team that is building machine learning models or other data intensive analysis then make sure to give this a listen and then start using DVC today.

Announcements
  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to ​serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected])
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Dmitry Petrov about DVC, an open source version control system for machine learning projects
  • Interview
    • Introductions
    • How did you get introduced to Python?
    • Can you start by explaining what DVC is and how it got started?
    • How do the needs of machine learning projects differ from other software applications in terms of version control?
    • Can you walk through the workflow of a project that uses DVC?
      • What are some of the main ways that it differs from your experience building machine learning projects without DVC?
      • In addition to the data that is used for training, the code that generates the model, and the end result there are other aspects such as the feature definitions and hyperparameters that are used. Can you discuss how those factor into the final model and any facilities in DVC to track the values used?
      • In addition to version control for software applications, there are a number of other pieces of tooling that are useful for building and maintaining healthy projects such as linting and unit tests. What are some of the adjacent concerns that should be considered when building machine learning projects?
      • What types of metrics do you track in DVC and how are they collected?
        • Are there specific problem domains or model types that require tracking different metric formats?
        • In the documentation it mentions that the data files live outside of git and can be managed in external storage systems. I’m wondering if there are any plans to integrate with systems such as Quilt or Pachyderm that provide versioning of data natively and what would be involved in adding that support?
        • What was your motivation for implementing this system in Python?
          • If you were to start over today what would you do differently?
          • Being a venture backed startup that is producing open source products, what is the value equation that makes it worthwile for your investors?
          • What have been some of the most interesting, challenging, or unexpected aspects of building DVC?
          • What do you have planned for the future of DVC?
          • Keep In Touch
            • dmpetrov on GitHub
            • Blog
            • @fullstackml on Twitter
            • LinkedIn
            • Picks
              • Tobias
                • Otter.ai
                • Dmitry
                  • Go outside and get some fresh air
                  • Links
                    • DVC
                    • Iterative.ai
                    • Linear Regression
                    • Logistic Regression
                    • C++
                    • Perl
                    • Git
                    • Version Control System
                    • Uber Michaelangelo
                    • Domino Data Lab
                    • Git LFS
                    • AUC == Area Under Curve metric for evaluating machine learning model performance
                    • Wes McKinney Interview
                    • PyTorch
                      • Podcast Interview
                      • Tensorflow
                      • TensorBoard
                      • MLFlow
                      • Quilt Data
                        • Data Engineering Podcast Episode
                        • Pachyderm
                          • Data Engineering Podcast Episode
                          • Apache Airflow
                            • Podcast Interview
                            • The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

                              ...more
                              View all episodesView all episodes
                              Download on the App Store

                              The Python Podcast.__init__By Tobias Macey

                              • 4.4
                              • 4.4
                              • 4.4
                              • 4.4
                              • 4.4

                              4.4

                              100 ratings


                              More shows like The Python Podcast.__init__

                              View all
                              The Changelog: Software Development, Open Source by Changelog Media

                              The Changelog: Software Development, Open Source

                              283 Listeners

                              Data Skeptic by Kyle Polich

                              Data Skeptic

                              482 Listeners

                              Chat With Traders by Tessa Dao

                              Chat With Traders

                              1,979 Listeners

                              Talk Python To Me by Michael Kennedy

                              Talk Python To Me

                              593 Listeners

                              Software Engineering Daily by Software Engineering Daily

                              Software Engineering Daily

                              624 Listeners

                              The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                              The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                              445 Listeners

                              Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                              Super Data Science: ML & AI Podcast with Jon Krohn

                              298 Listeners

                              Python Bytes by Michael Kennedy and Brian Okken

                              Python Bytes

                              213 Listeners

                              Data Engineering Podcast by Tobias Macey

                              Data Engineering Podcast

                              142 Listeners

                              Machine Learning Guide by OCDevel

                              Machine Learning Guide

                              764 Listeners

                              Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

                              Syntax - Tasty Web Development Treats

                              982 Listeners

                              DataFramed by DataCamp

                              DataFramed

                              267 Listeners

                              Practical AI by Practical AI LLC

                              Practical AI

                              189 Listeners

                              The Real Python Podcast by Real Python

                              The Real Python Podcast

                              140 Listeners

                              Hard Fork by The New York Times

                              Hard Fork

                              5,420 Listeners