O'Reilly Data Show Podcast

Building accessible tools for large-scale computation and machine learning


Listen Later

In this episode of the Data Show, I spoke with Eric Jonas, a postdoc in the new Berkeley Center for Computational Imaging. Jonas is also affiliated with UC Berkeley’s RISE Lab. It was at a RISE Lab event that he first announced Pywren, a framework that lets data enthusiasts proficient with Python run existing code at massive scale on Amazon Web Services. Jonas and his collaborators are working on a related project, NumPyWren, a system for linear algebra built on a serverless architecture. Their hope is that by lowering the barrier to large-scale (scientific) computation, we will see many more experiments and research projects from communities that have been unable to easily marshal massive compute resources. We talked about Bayesian machine learning, scientific computation, reinforcement learning, and his stint as an entrepreneur in the enterprise software space.
Here are some highlights from our conversation:
Pywren
The real enabling technology for us was when Amazon announced the availability of AWS Lambda, their microservices framework, in 2014. Following this prompting, I went home one weekend and thought, ‘I wonder how hard it is to take an arbitrary Python function and marshal it across the wire, get it running in Lambda; I wonder how many I can get at once?’ Thus, Pywren was born.
… Right now, we’re primarily focused on the entire scientific Python stack, so SciPy, NumPy, Pandas, Matplotlib, the whole ecosystem there. … One of the challenges with all of these frameworks and running these things on Lambda is that, right now, Lambda is a fairly constrained resource environment. Amazon will quite happily give you 3,000 cores in the next two seconds, but each one has a maximum runtime and a small amount of memory and a small amount of local disk. Part of the current active research thrust for Pywren is figuring out how to do more general-purpose computation within those resource limits. But right now, we mostly support everything you would encounter in your normal Python workflow—including Jupyter, NumPy, and scikit-learn.
Numpywren
Chris Ré has this nice quote: ‘Why is it easier to train a bidirectional LSTM with attention than it is to just compute the SVD of a giant matrix?’ One of these things is actually fantastically more complicated than the other, but right now, our linear algebra tools are just such an impediment to doing that sort of large-scale computation. We hope NumPyWren will enable this class of work for the machine learning community.
The growing importance of reinforcement learning
Ben Recht makes the argument that the most interesting problems in machine learning right now involve taking action based upon your intelligence. I think he’s right about this—taking action based upon past data and doing it in a way that is safe and robust and reliable and all of these sorts of things. That is very much the domain that has traditionally been occupied by fields like control theory and reinforcement learning.
Reinforcement learning and Ray
Ray is an excellent platform for building large-scale distributed systems, and it’s much more Python-native than Spark was. Ray also has much more of a focus on real-time performance. A lot of the things that people are interested in with Ray revolve around doing things like large-scale reinforcement learning—and it just so happens that deep reinforcement learning is something that everyone’s really excited about.
Related resources:
“Optimization, compressed sensing, and large-scale machine learning pipelines”: the O’Reilly Data Show Podcast featuring Ben Recht.
“Notes from the first Ray meetup”
“Practical applications of reinforcement learning in industry”
“Building tools for the AI applications of tomorrow”
“Toward the Jet Age of machine learning”
...more
View all episodesView all episodes
Download on the App Store

O'Reilly Data Show PodcastBy O'Reilly Media

  • 4
  • 4
  • 4
  • 4
  • 4

4

63 ratings


More shows like O'Reilly Data Show Podcast

View all
The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

285 Listeners

O'Reilly Radar Podcast - O'Reilly Media Podcast by O'Reilly Media

O'Reilly Radar Podcast - O'Reilly Media Podcast

35 Listeners

Data Skeptic by Kyle Polich

Data Skeptic

475 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

580 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

624 Listeners

O'Reilly Design Podcast - O'Reilly Media Podcast by O'Reilly Media

O'Reilly Design Podcast - O'Reilly Media Podcast

8 Listeners

AWS Podcast by Amazon Web Services

AWS Podcast

203 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

295 Listeners

Python Bytes by Michael Kennedy and Brian Okken

Python Bytes

214 Listeners

Data Engineering Podcast by Tobias Macey

Data Engineering Podcast

139 Listeners

DataFramed by DataCamp

DataFramed

266 Listeners

Practical AI by Practical AI LLC

Practical AI

196 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

187 Listeners

Me, Myself, and AI by MIT Sloan Management Review and Boston Consulting Group (BCG)

Me, Myself, and AI

101 Listeners

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning by Jaeden Schafer

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning

139 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

178 Listeners

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

397 Listeners