Data Engineering Podcast

Metadata Management And Integration At LinkedIn With DataHub


Listen Later

Summary

In order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration that need to be solved. The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. In this episode Mars Lan and Pardhu Gunnam explain how they designed the platform, how it integrates into their data platforms, and how it is being used to power data discovery and analytics at LinkedIn.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
  • Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about DataHub, LinkedIn’s metadata management and data catalog platform
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by giving an overview of what DataHub is and some of its back story?
      • What were you using at LinkedIn for metadata management prior to the introduction of DataHub?
      • What was lacking in the previous solutions that motivated you to create a new platform?
      • There are a large number of other systems available for building data catalogs and tracking metadata, both open source and proprietary. What are the features of DataHub that would lead someone to use it in place of the other options?
      • Who is the target audience for DataHub?
        • How do the needs of those end users influence or constrain your approach to the design and interfaces provided by DataHub?
        • Can you describe how DataHub is architected?
          • How has it evolved since you first began working on it?
          • What was your motivation for releasing DataHub as an open source project?
            • What have been the benefits of that decision?
            • What are the challenges that you face in maintaining changes between the public repository and your internally deployed instance?
            • What is the workflow for populating metadata into DataHub?
            • What are the challenges that you see in managing the format of metadata and establishing consistent models for the information being stored?
            • How do you handle discovery of data assets for users of DataHub?
            • What are the integration and extension points of the platform?
            • What is involved in deploying and maintaining and instance of the DataHub platform?
            • What are some of the most interesting or unexpected ways that you have seen DataHub used inside or outside of LinkedIn?
            • What are some of the most interesting, unexpected, or challenging lessons that you learned while building and working with DataHub?
            • When is DataHub the wrong choice?
            • What do you have planned for the future of the project?
            • Contact Info
              • Mars
                • LinkedIn
                • mars-lan on GitHub
                • Pardhu
                  • LinkedIn
                  • Parting Question
                    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                    • Closing Announcements
                      • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
                      • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                      • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                      • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
                      • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
                      • Links
                        • DataHub
                        • Map/Reduce
                        • Apache Flume
                        • LinkedIn Blog Post introducing DataHub
                        • WhereHows
                        • Hive Metastore
                        • Kafka
                        • CDC == Change Data Capture
                          • Podcast Episode
                          • PDL LinkedIn language
                          • GraphQL
                          • Elasticsearch
                          • Neo4J
                          • Apache Pinot
                          • Apache Gobblin
                          • Apache Samza
                          • Open Sourcing DataHub Blog Post
                          • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                            Support Data Engineering Podcast

                            ...more
                            View all episodesView all episodes
                            Download on the App Store

                            Data Engineering PodcastBy Tobias Macey

                            • 4.5
                            • 4.5
                            • 4.5
                            • 4.5
                            • 4.5

                            4.5

                            142 ratings


                            More shows like Data Engineering Podcast

                            View all
                            The Changelog: Software Development, Open Source by Changelog Media

                            The Changelog: Software Development, Open Source

                            289 Listeners

                            Software Engineering Daily by Software Engineering Daily

                            Software Engineering Daily

                            623 Listeners

                            Talk Python To Me by Michael Kennedy

                            Talk Python To Me

                            583 Listeners

                            Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                            Super Data Science: ML & AI Podcast with Jon Krohn

                            302 Listeners

                            NVIDIA AI Podcast by NVIDIA

                            NVIDIA AI Podcast

                            334 Listeners

                            Practical AI by Practical AI LLC

                            Practical AI

                            203 Listeners

                            AWS Podcast by Amazon Web Services

                            AWS Podcast

                            205 Listeners

                            Last Week in AI by Skynet Today

                            Last Week in AI

                            305 Listeners

                            Dwarkesh Podcast by Dwarkesh Patel

                            Dwarkesh Podcast

                            517 Listeners

                            The Data Engineering Show by The Firebolt Data Bros

                            The Data Engineering Show

                            8 Listeners

                            No Priors: Artificial Intelligence | Technology | Startups by Conviction

                            No Priors: Artificial Intelligence | Technology | Startups

                            130 Listeners

                            Latent Space: The AI Engineer Podcast by swyx + Alessio

                            Latent Space: The AI Engineer Podcast

                            92 Listeners

                            This Day in AI Podcast by Michael Sharkey, Chris Sharkey

                            This Day in AI Podcast

                            228 Listeners

                            The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

                            The AI Daily Brief: Artificial Intelligence News and Analysis

                            631 Listeners

                            AI + a16z by a16z

                            AI + a16z

                            36 Listeners