With the release of Spark version 2.0, streaming starts becoming much more accessible to users. By adopting a continuous processing model (on an infinite table), the developers of Spark have enabled users of its SQL or DataFrame APIs to extend their analytic capabilities to unbounded streams.

Within the Spark community, Databricks Engineer, Michael Armbrust is well-known for having led the long-term project to move Spark’s interactive analytics engine from Shark to Spark SQL. (Full disclosure: I’m an advisor to Databricks.) Most recently he has turned his efforts to helping introduce a much simpler stream processing model to Spark Streaming (“structured streaming”). 

Structured streaming comes to Apache Spark 2.0

The O'Reilly Data Show Podcast explores the opportunities and techniques driving big data, data science, and AI.

In this episode of the Data Show, I speak with Peter Bailis, founder and CEO of Sisu, a startup that is using machine learning to improve operational analytics. Bailis is also an assistant professor of computer science at Stanford University, where he conducts research into data-intensive systems and where he is co-founder of the DAWN Lab.
We had a great conversation spanning many topics, including:

His personal blog, which contains some of the best explainers on emerging topics in data management and distributed systems.
The role of machine learning in operational analytics and business intelligence.
Machine learning benchmarks—specifically two recent ML initiatives that he’s been involved with: DAWNBench and MLPerf.
Trends in data management and in tools for machine learning development, governance, and operations.

Related resources:

“Setting benchmarks in machine learning”: Dave Patterson, Peter Bailis, and other industry leaders discuss how MLPerf will define an entire suite of benchmarks to measure performance of software, hardware, and cloud systems.
“The quest for high-quality data”
“RISELab’s AutoPandas hints at automation tech that will change the nature of software development”
Jeff Jonas on “Real-time entity resolution made accessible”
“What are model governance and model operations?”
“We need to build machine learning tools to augment machine learning engineers”

Machine learning for operational analytics and business intelligence

In this episode of the Data Show, I speak with Arun Kejariwal of Facebook and Ira Cohen of Anodot (full disclosure: I’m an advisor to Anodot). This conversation stemmed from a recent online panel discussion we did, where we discussed time series data, and, specifically, anomaly detection and forecasting. Both Kejariwal (at Machine Zone, Twitter, and Facebook) and Cohen (at HP and Anodot) have extensive experience building analytic and machine learning solutions at large scale, and both have worked extensively with time-series data. The growing interest in AI and machine learning has not been confined to computer vision, speech technologies, or text. In the enterprise, there is strong interest in using similar automation tools for temporal data and time series.
We had a great conversation spanning many topics, including:

Why businesses should care about anomaly detection and forecasting; specifically, we delve into examples outside of IT Operations & Monitoring.
(Specialized) techniques and tools for automating some of the relevant tasks, including signal processing, statistical methods, and machine learning.
What are some of the key features of an anomaly detection or forecasting system.
What lies ahead for large-scale systems for time series analysis.

Related resources:

“Product management in the machine learning era” – a new tutorial at the Artificial Intelligence Conference in London
“One simple chart: Who is interested in Apache Pulsar?”
Ira Cohen: “Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series”
“Got speech? These guidelines will help you get started building voice applications”
“RISELab’s AutoPandas hints at automation tech that will change the nature of software development”
Ameet Talwalker: “How to train and deploy deep learning at scale”

Machine learning and analytics for time series data

In this episode of the Data Show, I speak with Michael Mahoney, a member of RISELab, the International Computer Science Institute, and the Department of Statistics at UC Berkeley. A physicist by training, Mahoney has been at the forefront of many important problems in large-scale data analysis. On the theoretical side, his works spans algorithmic and statistical methods for matrices, graphs, regression, optimization, and related problems. On the applications side, he has contributed to systems used for internet and social media analysis, social network analysis, as well as for a host of applications in the physical and life sciences. Most recently, he has been working on deep neural networks, specifically developing theoretical methods and practical diagnostic tools that should be helpful to practitioners who use deep learning.
Analyzing deep neural networks with WeightWatcher. Image by Michael Mahoney and Charles Martin, used with permission.
We had a great conversation spanning many topics, including:

The class of problems in big data, machine learning, and data analysis that he has worked on at Yahoo, Stanford, and Berkeley.
The new UC Berkeley FODA (Foundations of Data Analysis) Institute.
HAWQ (Hessian AWare Quantization of Neural Networks with Mixed-Precision), a new framework for addressing problems pertaining to model size and inference speed/power in deep learning.
WeightWatcher: a new open source project for predicting the accuracy of deep neural networks. WeightWatcher stems from a recent series of papers with Charles Martin, of Calculation Consulting.

Related resources:

“Deep learning at scale: Tools and solutions” – a new tutorial at the Artificial Intelligence Conference in San Jose
Ameet Talwalker on “How to train and deploy deep learning at scale”
Greg Diamos on “How big compute is powering the deep learning rocket ship”
“RISELab’s AutoPandas hints at automation tech that will change the nature of software development”
Reza Zadeh on “Scaling machine learning”
“Becoming a machine learning company means investing in foundational technologies”
“Managing risk in machine learning”
“What are model governance and model operations?”
“Product management in the machine learning era”: a tutorial at the Artificial Intelligence Conference in San Jose

Understanding deep neural networks

In this episode of the Data Show, I speak with Kesha Williams, technical instructor at A Cloud Guru, a training company focused on cloud computing. As a full stack web developer, Williams became intrigued by machine learning and started teaching herself the ML tools on Amazon Web Services. Fast forward to today, Williams has built some well-regarded Alexa skills, mastered ML services on AWS, and has now firmly added machine learning to her developer toolkit.
Anatomy of an Alexa skill. Image by Kesha Williams, used with permission.
We had a great conversation spanning many topics, including:

How she got started and made the transition into a full-fledged machine learning practitioner.
We discussed the evolution of ML tools and learning resources, and how accessible they’ve become for developers.
How to build and monetize Alexa skills. Along the way, we took a deep dive and discussed some of the more interesting Alexa skills she has built, as well as one that she really admires.

Related resources:

“Product management in the machine learning era”: a new tutorial session at the Artificial Intelligence Conference in London
Cassie Kozyrkov: “Make data science more useful”
Kartik Hosanagar: “Algorithms are shaping our lives—here’s how we wrest back control”
Francesca Lazzeri and Jaya Mathew: “Lessons learned while helping enterprises adopt machine learning”
Jerry Overton: “Teaching and implementing data science and AI in the enterprise”
“Becoming a machine learning company means investing in foundational technologies”
“Managing risk in machine learning”
“What are model governance and model operations?”

Becoming a machine learning practitioner

In this episode of the Data Show, I speak with Alex Ratner, project lead for Stanford’s Snorkel open source project; Ratner also recently garnered a faculty position at the University of Washington and is currently working on a company supporting and extending the Snorkel project. Snorkel is a framework for building and managing training data. Based on our survey from earlier this year, labeled data remains a key bottleneck for organizations building machine learning applications and services.
Ratner was a guest on the podcast a little over two years ago when Snorkel was a relatively new project. Since then, Snorkel has added more features, expanded into computer vision use cases, and now boasts many users, including Google, Intel, IBM, and other organizations. Along with his thesis advisor professor Chris Ré of Stanford, Ratner and his collaborators have long championed the importance of building tools aimed squarely at helping teams build and manage training data. With today’s release of Snorkel version 0.9, we are a step closer to having a framework that enables the programmatic creation of training data sets.
Snorkel pipeline for data labeling. Source: Alex Ratner, used with permission.
We had a great conversation spanning many topics, including:

Why he and his collaborators decided to focus on “data programming” and tools for building and managing training data.
A tour through Snorkel, including its target users and key components.
What’s in the newly released version (v 0.9) of Snorkel.
The number of Snorkel’s users has grown quite a bit since we last spoke, so we went through some of the common use cases for the project.
Data lineage, AutoML, and end-to-end automation of machine learning pipelines.
Holoclean and other projects focused on data quality and data programming.
The need for tools that can ease the transition from raw data to derived data (e.g., entities), insights, and even knowledge.

Related resources:

“Product management in the machine learning era”: A tutorial at the Artificial Intelligence Conference in San Jose, September 9-12, 2019.
Chris Ré: “Software 2.0 and Snorkel”
Alex Ratner: “Creating large training data sets quickly”
Ihab Ilyas and Ben Lorica on “The quest for high-quality data”
Roger Chen: “Acquiring and sharing high-quality data”
Jeff Jonas on “Real-time entity resolution made accessible”
“Data collection and data markets in the age of privacy and machine learning”

Structured streaming comes to Apache Spark 2.0

Download our free app to listen on your phone