September 15, 2020

Distributed In Memory Processing And Streaming With Hazelcast

44 minutes

Summary

In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management

What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.

When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!

Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial.

You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!

Your host is Tobias Macey and today I’m interviewing Dale Kim about Hazelcast, a distributed in-memory computing platform for data intensive applications

Interview

Introduction

How did you get involved in the area of data management?

Can you start by describing what Hazelcast is and its origins?

What are the benefits and tradeoffs of in-memory computation for data-intensive workloads?

What are some of the common use cases for the Hazelcast in memory grid?

How is Hazelcast implemented?

How has the architecture evolved since it was first created?

How is the Jet streaming framework architected?

What was the motivation for building it?

How do the capabilities of Jet compare to systems such as Flink or Spark Streaming?

How has the introduction of hardware capabilities such as NVMe drives influenced the market for in-memory systems?

How is the governance of the open source grid and Jet projects handled?

What is the guiding heuristic for which capabilities or features to include in the open source projects vs. the commercial offerings?

What is involved in building an application or workflow on top of Hazelcast?

What are the common patterns for engineers who are building on top of Hazelcast?

What is involved in deploying and maintaining an installation of the Hazelcast grid or Jet streaming?

What are the scaling factors for Hazelcast?

What are the edge cases that users should be aware of?

What are some of the most interesting, innovative, or unexpected ways that you have seen Hazelcast used?

When is Hazelcast Grid or Jet the wrong choice?

What is in store for the future of Hazelcast?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.

If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.

To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

HazelCast

Istanbul

Apache Spark

OrientDB

CAP Theorem

NVMe

Memristors

Intel Optane Persistent Memory

Hazelcast Jet

Kappa Architecture

IBM Cloud Paks

Digital Integration Hub (Gartner)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

...more

View all episodes

By Tobias Macey

4.5

142142 ratings