Data Engineering Podcast

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way


Listen Later

Summary

Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. Srivatsan Sridharan has had the opportunity to design, build, and run data lake platforms for both Yelp and Robinhood, with many valuable lessons learned from each experience. In this episode he shares his insights and advice on how to approach such an undertaking in your own organization.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
  • Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
  • Your host is Tobias Macey and today I’m interviewing Srivatsan Sridharan about the technological, staffing, and design considerations for building a data platform
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what your experience has been with designing and implementing data platforms?
    • What are the elements that you have found to be common requirements across organizations and data characteristics?
    • What are the architectural elements that require the most detailed consideration based on organizational needs and data requirements?
    • How has the ecosystem for building maintainable and usable data lakes matured over the past few years?
      • What are the elements that are still cumbersome or intractable?
      • The streaming ecosystem has also gone through substantial changes over the past few years. What is your synopsis of the meaningful differences between todays options and where we were ~6 years ago?
      • How did your experiences at Yelp inform your current architectural approach at Robinhood?
      • Can you describe your current platform architecture?
        • What are the primary capabilities that you are optimizing for?
        • What is your evaluation process for determining what components to use in your platform?
          • How do you approach the build vs. buy problem and quantify the tradeoffs?
          • What are the most interesting, innovative, or unexpected ways that you have seen your data systems used?
          • What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing data platforms across your career?
          • When is a data lake architecture the wrong choice?
          • What do you have planned for the future of the data platform at Robinhood?
          • Contact Info
            • LinkedIn
            • Parting Question
              • From your perspective, what is the biggest gap in the tooling or technology for data management today?
              • Closing Announcements
                • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
                • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
                • Links
                  • Robinhood
                  • Yelp
                  • Kafka
                  • Spark
                  • Flink
                    • Podcast Episode
                    • Pulsar
                      • Podcast Episode
                      • Parquet
                      • Change Data Capture
                      • Delta Lake
                        • Podcast Episode
                        • Hudi
                          • Podcast Episode
                          • Redshift
                          • BigQuery
                          • Informatica
                          • Data Mesh
                            • Podcast Episode
                            • PrestoDB
                            • Trino
                            • Airbyte
                              • Podcast Episode
                              • Meltano
                                • Podcast Episode
                                • Fivetran
                                  • Podcast Episode
                                  • Stitch
                                  • Pinot
                                    • Podcast Episode
                                    • Clickhouse
                                      • Podcast Episode
                                      • Druid
                                      • Iceberg
                                        • Podcast Episode
                                        • Looker
                                          • Podcast Episode
                                          • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                                            Support Data Engineering Podcast

                                            ...more
                                            View all episodesView all episodes
                                            Download on the App Store

                                            Data Engineering PodcastBy Tobias Macey

                                            • 4.6
                                            • 4.6
                                            • 4.6
                                            • 4.6
                                            • 4.6

                                            4.6

                                            135 ratings


                                            More shows like Data Engineering Podcast

                                            View all
                                            Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                                            Software Engineering Radio - the podcast for professional software developers

                                            272 Listeners

                                            The Changelog: Software Development, Open Source by Changelog Media

                                            The Changelog: Software Development, Open Source

                                            282 Listeners

                                            The Cloudcast by Massive Studios

                                            The Cloudcast

                                            152 Listeners

                                            Thoughtworks Technology Podcast by Thoughtworks

                                            Thoughtworks Technology Podcast

                                            42 Listeners

                                            Data Skeptic by Kyle Polich

                                            Data Skeptic

                                            481 Listeners

                                            Talk Python To Me by Michael Kennedy

                                            Talk Python To Me

                                            590 Listeners

                                            Software Engineering Daily by Software Engineering Daily

                                            Software Engineering Daily

                                            626 Listeners

                                            The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                                            The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                                            440 Listeners

                                            Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                                            Super Data Science: ML & AI Podcast with Jon Krohn

                                            299 Listeners

                                            Python Bytes by Michael Kennedy and Brian Okken

                                            Python Bytes

                                            213 Listeners

                                            DataFramed by DataCamp

                                            DataFramed

                                            265 Listeners

                                            Practical AI by Practical AI LLC

                                            Practical AI

                                            189 Listeners

                                            The Stack Overflow Podcast by The Stack Overflow Podcast

                                            The Stack Overflow Podcast

                                            64 Listeners

                                            The Real Python Podcast by Real Python

                                            The Real Python Podcast

                                            140 Listeners

                                            Latent Space: The AI Engineer Podcast by swyx + Alessio

                                            Latent Space: The AI Engineer Podcast

                                            76 Listeners