Machine Learning Archives - Software Engineering Daily

Reflow: Distributed Incremental Processing with Marius Eriksen


Listen Later

The volume of data in the world is always increasing. The costs of storing that data is always decreasing. And the means for processing that data is always evolving.

Sensors, cameras, and other small computers gather large quantities of data from the physical world around us. User analytics tools gather information about how we are interacting with the Internet. Logging servers collect terabytes of records about how our systems are performing.

From the popularity of MapReduce, to the rise of open source distributed processing frameworks like Spark and Flink, to the wide variety of cloud services like BigQuery: there is an endless set of choices for how to analyze gigantic sets of data.

Machine learning training and inference is another dimension of the modern data engineering stack. Whereas tools like Spark and BigQuery are great for performing ad-hoc queries, systems like TensorFlow are optimized for the model training and deployment process.

Stitching together these tools allows a developer to compose workflows for how data pipelines progress through a data engineering system. One popular tool for this is Apache Airflow, which was created in 2014 and is widely used at companies like Airbnb.

Over the next few years, we will see a proliferation of new tools in the world of data engineering–and for good reason. There is a wealth of opportunity for companies to leverage their data to make better decisions, and potentially to clean and offer their internal data as APIs and pre-trained machine learning models.

Today, there is a vast number of enterprises who are modernizing their software development process with Kubernetes, cloud providers, and continuous delivery. Eventually, these enterprises will improve their complex software architecture, and will move from a defensive position to an offensive one. These enterprises will shift their modernization efforts from “DevOps” to “DataOps”, and thousands of software vendors will be ready to sell them new software for modernizing their data platform.

There is not a consensus for the best way to build and run a “data platform”. Nearly every company we have talked to on the show has a different definition and a different architecture for their “data platform”: Doordash, Dremio, Prisma, Uber, MapR, Snowflake, Confluent, Databricks

We don’t expect to have a concise answer for how to run a data platform any time soon–but on the bright side, data infrastructure seems to be improving. Companies are increasingly able to ask questions about their data and get quick answers, in contrast to the data breadlines that were so prevalent five years ago.

Today we cover yet another approach to large scale data processing.

Reflow is a system for incremental data processing in the cloud. Reflow includes a functional, domain specific language for writing workflow programs, a runtime for evaluating those programs incrementally, and a scheduler for dynamically provisioning resources for those workflows. Reflow was created for large bioinformatics workloads, but should be broadly applicable to scientific and engineering computing workloads.

Reflow  evaluates programs incrementally. Whenever the input data or the program changes, only the outputs that depend on the changes are recomputed. This minimizes the amount of recomputation that needs to be performed across a computational graph.

Marius Eriksen is the creator of Reflow and an engineer at GRAIL. He joins the show to discuss the motivation for a new data processing system–which involves explaining why workloads in bioinformatics are different than in some other domains.

The post Reflow: Distributed Incremental Processing with Marius Eriksen appeared first on Software Engineering Daily.

...more
View all episodesView all episodes
Download on the App Store

Machine Learning Archives - Software Engineering DailyBy Machine Learning Archives - Software Engineering Daily

  • 4.4
  • 4.4
  • 4.4
  • 4.4
  • 4.4

4.4

69 ratings


More shows like Machine Learning Archives - Software Engineering Daily

View all
The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

285 Listeners

Data Skeptic by Kyle Polich

Data Skeptic

474 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

584 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

630 Listeners

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

429 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

295 Listeners

Python Bytes by Michael Kennedy and Brian Okken

Python Bytes

212 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

321 Listeners

Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

Syntax - Tasty Web Development Treats

985 Listeners

DataFramed by DataCamp

DataFramed

267 Listeners

Practical AI by Practical AI LLC

Practical AI

196 Listeners

Last Week in AI by Skynet Today

Last Week in AI

275 Listeners

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning by Jaeden Schafer

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning

143 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

193 Listeners

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

420 Listeners