Data Engineering Podcast

Add Version Control To Your Data Lake With LakeFS


Listen Later

Summary

Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
  • Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
  • Your host is Tobias Macey and today I’m interviewing Einat Orr and Oz Katz about their work at Treeverse on the LakeFS system for versioning your data lakes the same way you version your code.
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by giving an overview of what LakeFS is and why you built it?
      • There are a number of tools and platforms that support data virtualization and data versioning. How does LakeFS compare to the available options? (e.g. Alluxio, Denodo, Pachyderm, DVC, etc.)
      • What are the primary use cases that LakeFS enables?
      • For someone who wants to use LakeFS what is involved in getting it set up?
      • How is LakeFS implemented?
        • How has the design of the system changed or evolved since you began working on it?
        • What assumptions did you have going into it which have since been invalidated or modified?
        • How does the workflow for an engineer or analyst change from working directly against S3 to running against the LakeFS interface?
        • How do you handle merge conflicts and resolution?
          • What are some of the potential edge cases or foot guns that they should be aware of when there are multiple people using the same repository?
          • How do you approach management of the data lifecycle or garbage collection to avoid ballooning the cost of storage for a dataset that is tracking a high number of branches with diverging commits?
          • Given that S3 and GCS are eventually consistent storage layers, how do you handle snapshots/transactionality of the data you are working with?
          • What are the axes for scaling an installation of LakeFS?
            • What are the limitations in terms of size or geographic distribution of the datasets?
            • What are some of the most interesting, unexpected, or innovative ways that you have seen LakeFS being used?
            • What are the most interesting, unexpected, or challenging lessons that you have learned while building LakeFS?
            • When is LakeFS the wrong choice?
            • What do you have planned for the future of the project?
            • Contact Info
              • Einat Orr
              • Oz Katz
              • Parting Question
                • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                • Closing Announcements
                  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
                  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
                  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
                  • Links
                    • Treeverse
                    • LakeFS
                      • GitHub
                      • Documentation
                      • lakeFS Slack Channel
                      • SimilarWeb
                      • Kaggle
                      • DagsHub
                      • Alluxio
                      • Pachyderm
                      • DVC
                      • ML Ops (Machine Learning Operations)
                      • DoltHub
                      • Delta Lake
                        • Podcast Episode
                        • Hudi
                        • Iceberg Table Format
                          • Podcast Episode
                          • Kubernetes
                          • PostgreSQL
                            • Podcast Episode
                            • Git
                            • Spark
                            • Presto
                            • CockroachDB
                            • YugabyteDB
                            • Citus
                            • Hive Metastore
                            • Iceberg Table Format
                            • Immunai
                            • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                              Support Data Engineering Podcast

                              ...more
                              View all episodesView all episodes
                              Download on the App Store

                              Data Engineering PodcastBy Tobias Macey

                              • 4.5
                              • 4.5
                              • 4.5
                              • 4.5
                              • 4.5

                              4.5

                              142 ratings


                              More shows like Data Engineering Podcast

                              View all
                              The Changelog: Software Development, Open Source by Changelog Media

                              The Changelog: Software Development, Open Source

                              290 Listeners

                              Software Engineering Daily by Software Engineering Daily

                              Software Engineering Daily

                              623 Listeners

                              Talk Python To Me by Michael Kennedy

                              Talk Python To Me

                              584 Listeners

                              Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                              Super Data Science: ML & AI Podcast with Jon Krohn

                              302 Listeners

                              NVIDIA AI Podcast by NVIDIA

                              NVIDIA AI Podcast

                              333 Listeners

                              Practical AI by Practical AI LLC

                              Practical AI

                              204 Listeners

                              AWS Podcast by Amazon Web Services

                              AWS Podcast

                              205 Listeners

                              Last Week in AI by Skynet Today

                              Last Week in AI

                              306 Listeners

                              Dwarkesh Podcast by Dwarkesh Patel

                              Dwarkesh Podcast

                              517 Listeners

                              The Data Engineering Show by The Firebolt Data Bros

                              The Data Engineering Show

                              8 Listeners

                              No Priors: Artificial Intelligence | Technology | Startups by Conviction

                              No Priors: Artificial Intelligence | Technology | Startups

                              130 Listeners

                              Latent Space: The AI Engineer Podcast by swyx + Alessio

                              Latent Space: The AI Engineer Podcast

                              92 Listeners

                              This Day in AI Podcast by Michael Sharkey, Chris Sharkey

                              This Day in AI Podcast

                              228 Listeners

                              The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

                              The AI Daily Brief: Artificial Intelligence News and Analysis

                              630 Listeners

                              AI + a16z by a16z

                              AI + a16z

                              36 Listeners