Data Engineering Podcast

Simplify Your Data Architecture With The Presto Distributed SQL Engine


Listen Later

Summary

Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
  • Your host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by giving an overview of what Presto is and its origin story?
      • What was the motivation for releasing Presto as open source?
      • For someone who is responsible for architecting their organization’s data platform, what are some of the signals that Presto will be a good fit for them?
        • What are the primary ways that Presto is being used?
        • I interviewed your colleague at Starburst, Kamil 2 years ago. How has Presto changed or evolved in that time, both technically and in terms of community and ecosystem growth?
        • What are some of the deployment and scaling considerations that operators of Presto should be aware of?
        • What are the best practices that have been established for working with data through Presto in terms of centralizing in a data lake vs. federating across disparate storage locations?
        • What are the tradeoffs of using Presto on top of a data lake vs a vertically integrated warehouse solution?
        • When designing the layout of a data lake that will be interacted with via Presto, what are some of the data modeling considerations that can improve the odds of success?
        • What are some of the most interesting, unexpected, or innovative ways that you have seen Presto used?
        • What are the most interesting, unexpected, or challenging lessons that you have learned while building, growing, and supporting the Presto project?
        • When is Presto the wrong choice?
        • What is in store for the future of the Presto project and community?
        • Contact Info
          • LinkedIn
          • @mtraverso on Twitter
          • martint on GitHub
          • Parting Question
            • From your perspective, what is the biggest gap in the tooling or technology for data management today?
            • Closing Announcements
              • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
              • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
              • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
              • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
              • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
              • Links
                • Presto
                • Starburst Data
                  • Podcast Episode
                  • Hadoop
                  • Hive
                  • Glue Metastore
                  • BigQuery
                  • Kinesis
                  • Apache Pinot
                  • Elasticsearch
                  • ORC
                  • Parquet
                  • AWS Redshift
                  • Avro
                    • Podcast Episode
                    • LZ4
                    • Zstandard
                    • KafkaSQL
                    • Flink
                      • Podcast Episode
                      • PyTorch
                        • Podcast.__init__ Episode
                        • Tensorflow
                        • Spark
                        • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                          Support Data Engineering Podcast

                          ...more
                          View all episodesView all episodes
                          Download on the App Store

                          Data Engineering PodcastBy Tobias Macey

                          • 4.6
                          • 4.6
                          • 4.6
                          • 4.6
                          • 4.6

                          4.6

                          135 ratings


                          More shows like Data Engineering Podcast

                          View all
                          Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                          Software Engineering Radio - the podcast for professional software developers

                          272 Listeners

                          The Changelog: Software Development, Open Source by Changelog Media

                          The Changelog: Software Development, Open Source

                          283 Listeners

                          The Cloudcast by Massive Studios

                          The Cloudcast

                          152 Listeners

                          Thoughtworks Technology Podcast by Thoughtworks

                          Thoughtworks Technology Podcast

                          42 Listeners

                          Data Skeptic by Kyle Polich

                          Data Skeptic

                          481 Listeners

                          Talk Python To Me by Michael Kennedy

                          Talk Python To Me

                          592 Listeners

                          Software Engineering Daily by Software Engineering Daily

                          Software Engineering Daily

                          624 Listeners

                          The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                          The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                          443 Listeners

                          Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                          Super Data Science: ML & AI Podcast with Jon Krohn

                          296 Listeners

                          Python Bytes by Michael Kennedy and Brian Okken

                          Python Bytes

                          213 Listeners

                          DataFramed by DataCamp

                          DataFramed

                          266 Listeners

                          Practical AI by Practical AI LLC

                          Practical AI

                          189 Listeners

                          The Stack Overflow Podcast by The Stack Overflow Podcast

                          The Stack Overflow Podcast

                          64 Listeners

                          The Real Python Podcast by Real Python

                          The Real Python Podcast

                          140 Listeners

                          Latent Space: The AI Engineer Podcast by swyx + Alessio

                          Latent Space: The AI Engineer Podcast

                          77 Listeners