Data Engineering Podcast

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse


Listen Later

Summary

Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today.
  • Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem?
      • Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations?
      • What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018?
      • Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects?
        • Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons?
        • For someone who wants to manage their data in Iceberg tables, what does the implementation look like?
          • How does that change based on the type of query/processing engine being used?
          • Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance?
          • What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used?
          • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular?
          • When is Iceberg/Tabular the wrong choice?
          • What do you have planned for the future of Iceberg/Tabular?
          • Contact Info
            • LinkedIn
            • rdblue on GitHub
            • Parting Question
              • From your perspective, what is the biggest gap in the tooling or technology for data management today?
              • Closing Announcements
                • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
                • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                • If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
                • Links
                  • Iceberg
                    • Podcast Episode
                    • Hadoop
                    • Data Lakehouse
                    • ACID == Atomic, Consistent, Isolated, Durable
                    • Apache Hive
                    • Apache Impala
                    • Bodo
                      • Podcast Episode
                      • StarRocks
                      • Dremio
                        • Podcast Episode
                        • DDL == Data Definition Language
                        • Trino
                        • PrestoDB
                        • Apache Hudi
                          • Podcast Episode
                          • dbt
                          • Apache Flink
                          • TileDB
                            • Podcast Episode
                            • CDC == Change Data Capture
                            • Substrait
                            • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                              Sponsored By:

                              • Acryl: ![Acryl](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/2E3zCRd4.png)
                              The modern data stack needs a reimagined metadata management platform. Acryl Data’s vision is to bring clarity to your data through its next generation multi-cloud metadata management platform. Founded by the leaders that created projects like LinkedIn DataHub and Airbnb Dataportal, Acryl Data enables delightful search and discovery, data observability, and federated governance across data ecosystems. Signup for the SaaS product today at [dataengineeringpodcast.com/acryl](https://www.dataengineeringpodcast.com/acryl)

                              Support Data Engineering Podcast

                              ...more
                              View all episodesView all episodes
                              Download on the App Store

                              Data Engineering PodcastBy Tobias Macey

                              • 4.5
                              • 4.5
                              • 4.5
                              • 4.5
                              • 4.5

                              4.5

                              136 ratings


                              More shows like Data Engineering Podcast

                              View all
                              Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                              Software Engineering Radio - the podcast for professional software developers

                              272 Listeners

                              The Changelog: Software Development, Open Source by Changelog Media

                              The Changelog: Software Development, Open Source

                              283 Listeners

                              The Cloudcast by Massive Studios

                              The Cloudcast

                              154 Listeners

                              Thoughtworks Technology Podcast by Thoughtworks

                              Thoughtworks Technology Podcast

                              41 Listeners

                              Data Skeptic by Kyle Polich

                              Data Skeptic

                              476 Listeners

                              Talk Python To Me by Michael Kennedy

                              Talk Python To Me

                              584 Listeners

                              Software Engineering Daily by Software Engineering Daily

                              Software Engineering Daily

                              624 Listeners

                              Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                              Super Data Science: ML & AI Podcast with Jon Krohn

                              297 Listeners

                              Python Bytes by Michael Kennedy and Brian Okken

                              Python Bytes

                              214 Listeners

                              DataFramed by DataCamp

                              DataFramed

                              272 Listeners

                              Practical AI by Practical AI LLC

                              Practical AI

                              189 Listeners

                              The Stack Overflow Podcast by The Stack Overflow Podcast

                              The Stack Overflow Podcast

                              63 Listeners

                              The Real Python Podcast by Real Python

                              The Real Python Podcast

                              140 Listeners

                              Latent Space: The AI Engineer Podcast by swyx + Alessio

                              Latent Space: The AI Engineer Podcast

                              72 Listeners

                              The Pragmatic Engineer by Gergely Orosz

                              The Pragmatic Engineer

                              63 Listeners