August 30, 2018

Building accessible tools for large-scale computation and machine learning

53 minutes

In this episode of the Data Show, I spoke with Eric Jonas, a postdoc in the new Berkeley Center for Computational Imaging. Jonas is also affiliated with UC Berkeley’s RISE Lab. It was at a RISE Lab event that he first announced Pywren, a framework that lets data enthusiasts proficient with Python run existing code at massive scale on Amazon Web Services. Jonas and his collaborators are working on a related project, NumPyWren, a system for linear algebra built on a serverless architecture. Their hope is that by lowering the barrier to large-scale (scientific) computation, we will see many more experiments and research projects from communities that have been unable to easily marshal massive compute resources. We talked about Bayesian machine learning, scientific computation, reinforcement learning, and his stint as an entrepreneur in the enterprise software space.

Here are some highlights from our conversation:

Pywren

The real enabling technology for us was when Amazon announced the availability of AWS Lambda, their microservices framework, in 2014. Following this prompting, I went home one weekend and thought, ‘I wonder how hard it is to take an arbitrary Python function and marshal it across the wire, get it running in Lambda; I wonder how many I can get at once?’ Thus, Pywren was born.

… Right now, we’re primarily focused on the entire scientific Python stack, so SciPy, NumPy, Pandas, Matplotlib, the whole ecosystem there. … One of the challenges with all of these frameworks and running these things on Lambda is that, right now, Lambda is a fairly constrained resource environment. Amazon will quite happily give you 3,000 cores in the next two seconds, but each one has a maximum runtime and a small amount of memory and a small amount of local disk. Part of the current active research thrust for Pywren is figuring out how to do more general-purpose computation within those resource limits. But right now, we mostly support everything you would encounter in your normal Python workflow—including Jupyter, NumPy, and scikit-learn.

Numpywren

Chris Ré has this nice quote: ‘Why is it easier to train a bidirectional LSTM with attention than it is to just compute the SVD of a giant matrix?’ One of these things is actually fantastically more complicated than the other, but right now, our linear algebra tools are just such an impediment to doing that sort of large-scale computation. We hope NumPyWren will enable this class of work for the machine learning community.

The growing importance of reinforcement learning

Ben Recht makes the argument that the most interesting problems in machine learning right now involve taking action based upon your intelligence. I think he’s right about this—taking action based upon past data and doing it in a way that is safe and robust and reliable and all of these sorts of things. That is very much the domain that has traditionally been occupied by fields like control theory and reinforcement learning.

Reinforcement learning and Ray

Ray is an excellent platform for building large-scale distributed systems, and it’s much more Python-native than Spark was. Ray also has much more of a focus on real-time performance. A lot of the things that people are interested in with Ray revolve around doing things like large-scale reinforcement learning—and it just so happens that deep reinforcement learning is something that everyone’s really excited about.

Related resources:

“Optimization, compressed sensing, and large-scale machine learning pipelines”: the O’Reilly Data Show Podcast featuring Ben Recht.

“Notes from the first Ray meetup”

“Practical applications of reinforcement learning in industry”

“Building tools for the AI applications of tomorrow”

“Toward the Jet Age of machine learning”

...more