O'Reilly Data Show Podcast

How to train and deploy deep learning at scale


Listen Later

In this episode of the Data Show, I spoke with Ameet Talwalkar, assistant professor of machine learning at CMU and co-founder of Determined AI. He was an early and key contributor to Spark MLlib and a member of AMPLab. Most recently, he helped conceive and organize the first edition of SysML, a new academic conference at the intersection of systems and machine learning (ML).
We discussed using and deploying deep learning at scale. This is an empirical era for machine learning, and, as I noted in an earlier article, as successful as deep learning has been, our level of understanding of why it works so well is still lacking. In practice, machine learning engineers need to explore and experiment using different architectures and hyperparameters before they settle on a model that works for their specific use case. Training a single model usually involves big (labeled) data and big models; as such, exploring the space of possible model architectures and parameters can take days, weeks, or even months. Talwalkar has spent the last few years grappling with this problem as an academic researcher and as an entrepreneur. In this episode, he describes some of his related work on hyperparameter tuning, systems, and more.
Here are some highlights from our conversation:
Deep learning
I would say that you hear a lot about the modeling of problems associated with deep learning. How do I frame my problem as a machine learning problem? How do I pick my architecture? How do I debug things when things go wrong? … What we’ve seen in practice is that, maybe somewhat surprisingly, the biggest challenges that ML engineers face actually are due to the lack of tools and software for deep learning. These problems are sort of like hybrid systems/ML problems. Very similar to the sorts of research that came out of the AMPLab.
… Things like TensorFlow and Keras, and a lot of those other platforms that you mentioned, are great and they’re a great step forward. They’re really good at abstracting low-level details of a particular learning architecture. In five lines, you can describe how your architecture looks and then you can also specify what algorithms you want to use for training.
There are a lot of other systems challenges associated with actually going end to end, from data to a deployed model. The existing software solutions don’t really tackle a big set of these challenges. For example, regardless of the software you’re using, it takes days to weeks to train a deep learning model. There’s real open challenges of how to best use parallel and distributed computing both to train a particular model and in the context of tuning hyperparameters of different models.
We also found out the vast majority of organizations that we’ve spoken to in the last year or so who are using deep learning for what I’d call mission-critical problems, are actually doing it with on-premise hardware. Managing this hardware is a huge challenge and something that folks like me, if I’m working at a company with machine learning engineers, have to figure out for themselves. It’s kind of a mismatch between their interests and their skills, but it’s something they have to take care of.
Understanding distributed training
To give a little bit more background, the idea behind this work started about four years ago. There was no deep learning in Spark MLlib at the time. We were trying to figure out how to perform distributed training of deep learning in Spark. Before actually getting our hands really dirty and trying to actually implement anything we wanted to just do some back-of-the-envelope calculations to see what speed-ups you could hope to get.
… The two main ingredients here are just computation and communication. … We wanted to understand this landscape of distributed training, and, using Paleo, we’ve been able to get a good sense of this landscape without actually running experiments. The i
...more
View all episodesView all episodes
Download on the App Store

O'Reilly Data Show PodcastBy O'Reilly Media

  • 4
  • 4
  • 4
  • 4
  • 4

4

63 ratings


More shows like O'Reilly Data Show Podcast

View all
The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

283 Listeners

O'Reilly Radar Podcast - O'Reilly Media Podcast by O'Reilly Media

O'Reilly Radar Podcast - O'Reilly Media Podcast

36 Listeners

Data Skeptic by Kyle Polich

Data Skeptic

482 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

592 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

623 Listeners

O'Reilly Design Podcast - O'Reilly Media Podcast by O'Reilly Media

O'Reilly Design Podcast - O'Reilly Media Podcast

8 Listeners

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

446 Listeners

AWS Podcast by Amazon Web Services

AWS Podcast

202 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

297 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

323 Listeners

Machine Learning Guide by OCDevel

Machine Learning Guide

764 Listeners

AI Today Podcast by AI & Data Today

AI Today Podcast

146 Listeners

DataFramed by DataCamp

DataFramed

267 Listeners

Practical AI by Practical AI LLC

Practical AI

192 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

197 Listeners

Last Week in AI by Skynet Today

Last Week in AI

287 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

199 Listeners