February 19, 2023

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

55 minutes

Summary

Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management

Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today.

Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular

Interview

Introduction

How did you get involved in the area of data management?

Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem?

Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations?

What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018?

Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects?

Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons?

For someone who wants to manage their data in Iceberg tables, what does the implementation look like?

How does that change based on the type of query/processing engine being used?

Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance?

What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular?

When is Iceberg/Tabular the wrong choice?

What do you have planned for the future of Iceberg/Tabular?

Contact Info

rdblue on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.

If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.

To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Iceberg

Podcast Episode

Hadoop

Data Lakehouse

ACID == Atomic, Consistent, Isolated, Durable

Apache Hive

Apache Impala

Bodo

Podcast Episode

StarRocks

Dremio

Podcast Episode

DDL == Data Definition Language

Trino

PrestoDB

Apache Hudi

Podcast Episode

dbt

Apache Flink

TileDB

Podcast Episode

CDC == Change Data Capture

Substrait

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA