Data Engineering Podcast

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59


Listen Later

Summary

Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.

Preamble
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by explaining what Zookeeper is and how the project got started?
      • What are the main motivations for using a centralized coordination service for distributed systems?
      • What are the distributed systems primitives that are built into Zookeeper?
        • What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?
        • What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?
        • Can you discuss how Zookeeper is architected and how that design has evolved over time?
          • What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?
          • What are the scaling factors for Zookeeper?
            • What are the edge cases that users should be aware of?
            • Where does it fall on the axes of the CAP theorem?
            • What are the main failure modes for Zookeeper?
              • How much of the recovery logic is left up to the end user of the Zookeeper cluster?
              • Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?
              • In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?
                • What are some of the cases where Zookeeper is the wrong choice?
                • How have the needs of distributed systems engineers changed since you first began working on Zookeeper?
                • If you were to start the project over today, what would you do differently?
                  • Would you still use Java?
                  • What are some of the most interesting or unexpected ways that you have seen Zookeeper used?
                  • What do you have planned for the future of Zookeeper?
                  • Contact Info
                    • @phunt on Twitter
                    • Parting Question
                      • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                      • Links
                        • Zookeeper
                        • Cloudera
                        • Google Chubby
                        • Sourceforge
                        • HBase
                        • High Availability
                        • Fallacies of distributed computing
                        • Falsehoods programmers believe about networking
                        • Consul
                        • EtcD
                        • Apache Curator
                        • Raft Consensus Algorithm
                        • Zookeeper Atomic Broadcast
                        • SSD Write Cliff
                        • Apache Kafka
                        • Apache Flink
                          • Podcast Episode
                          • HDFS
                          • Kubernetes
                          • Netty
                          • Protocol Buffers
                          • Avro
                          • Rust
                          • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                            Support Data Engineering Podcast

                            ...more
                            View all episodesView all episodes
                            Download on the App Store

                            Data Engineering PodcastBy Tobias Macey

                            • 4.6
                            • 4.6
                            • 4.6
                            • 4.6
                            • 4.6

                            4.6

                            135 ratings


                            More shows like Data Engineering Podcast

                            View all
                            Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                            Software Engineering Radio - the podcast for professional software developers

                            272 Listeners

                            The Changelog: Software Development, Open Source by Changelog Media

                            The Changelog: Software Development, Open Source

                            283 Listeners

                            The Cloudcast by Massive Studios

                            The Cloudcast

                            152 Listeners

                            Thoughtworks Technology Podcast by Thoughtworks

                            Thoughtworks Technology Podcast

                            41 Listeners

                            Data Skeptic by Kyle Polich

                            Data Skeptic

                            482 Listeners

                            Talk Python To Me by Michael Kennedy

                            Talk Python To Me

                            592 Listeners

                            Software Engineering Daily by Software Engineering Daily

                            Software Engineering Daily

                            624 Listeners

                            The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                            The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                            443 Listeners

                            Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                            Super Data Science: ML & AI Podcast with Jon Krohn

                            298 Listeners

                            Python Bytes by Michael Kennedy and Brian Okken

                            Python Bytes

                            213 Listeners

                            DataFramed by DataCamp

                            DataFramed

                            266 Listeners

                            Practical AI by Practical AI LLC

                            Practical AI

                            189 Listeners

                            The Stack Overflow Podcast by The Stack Overflow Podcast

                            The Stack Overflow Podcast

                            64 Listeners

                            The Real Python Podcast by Real Python

                            The Real Python Podcast

                            140 Listeners

                            Latent Space: The AI Engineer Podcast by swyx + Alessio

                            Latent Space: The AI Engineer Podcast

                            77 Listeners