The Real Python Podcast

Scaling Data Science and Machine Learning Infrastructure Like Netflix

05.21.2021 - By Real PythonPlay

Download our free app to listen on your phone

Download on the App StoreGet it on Google Play

Would you move your data science project from a laptop to the cloud? Would you also like to have snapshots of your project saved along the way so that you can go back in time or share the state of your project with another team member? This week on the show, we have Savin Goyal from Netflix. Savin is the technical lead for machine learning infrastructure at Netflix. He joins us to talk about Metaflow, an open-source tool to simplify building, managing, and scaling data science projects.

Metaflow addresses the needs of the numerous data scientists who work at Netflix. Machine learning is key strength for the streaming service. They tried several existing tools to scale their own internal infrastructure and after this experimentation developed Metaflow.

We talk about the history of the project and how someone could get started with the open-source version. Savin also contrasts the cost of infrastructure as compared to data scientists and the cost of their time.

Course Spotlight: Simplify Python GUI Development With PySimpleGUI

In this step-by-step course, you’ll learn how to create a cross-platform graphical user interface (GUI) using Python and PySimpleGUI. A graphical user interface is an application that has buttons, windows, and lots of other elements that the user can use to interact with your application.

Topics:

00:00:00 – Introduction

00:01:53 – What is Metaflow?

00:04:15 – Savin’s background in data science and infrastructure

00:06:06 – Democratization of infrastructure and iteration of tools

00:10:34 – What information is saved about the infrastructure requirements for a project?

00:17:17 – How are the requirements annotated?

00:18:39 – Sponsor: Digital Ocean’s App Platform

00:19:15 – How do project snapshots work?

00:29:33 – Cost of infrastructure vs data scientists

00:32:28 – Working with data at Netflix scale

00:37:55 – Video Course Spotlight

00:39:06 – Getting an organization to use new tools and then making open-source

00:49:51 – Documentation of Metaflow and getting started on solving infrastructure problems

00:53:57 – What made you interested in working on infrastructure tools?

00:55:13 – What is something you are excited about in the world of Python?

00:56:18 – What do you want to learn next?

00:58:14 – Thanks and goodbye

Show Links:

Metaflow: A framework for real-life data science

Metaflow: Tutorials

More Data Science, Less Engineering with Netflix’s Metaflow By Savin Goyal - YouTube

R: The R Project for Statistical Computing

Tidyverse: R packages for data science

Anything you can do, I can do (kinda). Tidyverse pipes in Pandas

reticulate: R Interface to Python

Apache Airflow: Programmatically author, schedule and monitor workflows

Directed acyclic graph (DAG) - Wikipedia article

Serializing Objects With the Python pickle Module - Real Python Course

Level up your Python skills with our expert-led courses:

Learn Text Classification With Python and Keras

Using Jupyter Notebooks

Simplify Python GUI Development With PySimpleGUI

Support the podcast & join our community of Pythonistas

More episodes from The Real Python Podcast