Data Engineering Podcast

Data Infrastructure Automation For Private SaaS At Snowplow


Listen Later

Summary

One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Josh Beemster about how Snowplow manages deployment and maintenance of their managed service in their customer’s cloud accounts.
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by giving an overview of the components in your system architecture and the nature of your managed service?
    • What are some of the challenges that are inherent to private SaaS nature of your managed service?
    • What elements of your system require the most attention and maintenance to keep them running properly?
    • Which components in the pipeline are most subject to variability in traffic or resource pressure and what do you do to ensure proper capacity?
    • How do you manage deployment of the full Snowplow pipeline for your customers?
      • How has your strategy for deployment evolved since you first began Soffering the managed service?
      • How has the architecture of the pipeline evolved to simplify operations?
      • How much customization do you allow for in the event that the customer has their own system that they want to use in place of one of your supported components?
        • What are some of the common difficulties that you encounter when working with customers who need customized components, topologies, or event flows?
          • How does that reflect in the tooling that you use to manage their deployments?
          • What types of metrics do you track and what do you use for monitoring and alerting to ensure that your customers pipelines are running smoothly?
          • What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with and on Snowplow?
          • What are some lessons that you can generalize for management of data infrastructure more broadly?
          • If you could start over with all of Snowplow and the infrastructure automation for it today, what would you do differently?
          • What do you have planned for the future of the Snowplow product and infrastructure management?
          • Contact Info
            • LinkedIn
            • jbeemster on GitHub
            • @jbeemster1 on Twitter
            • Parting Question
              • From your perspective, what is the biggest gap in the tooling or technology for data management today?
              • Closing Announcements
                • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
                • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
                • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
                • Links
                  • Snowplow Analytics
                    • Podcast Episode
                    • Terraform
                    • Consul
                    • Nomad
                    • Meltdown Vulnerability
                    • Spectre Vulnerability
                    • AWS Kinesis
                    • Elasticsearch
                    • SnowflakeDB
                    • Indicative
                    • S3
                    • Segment
                    • AWS Cloudwatch
                    • Stackdriver
                    • Apache Kafka
                    • Apache Pulsar
                    • Google Cloud PubSub
                    • AWS SQS
                    • AWS SNS
                    • AWS Redshift
                    • Ansible
                    • AWS Cloudformation
                    • Kubernetes
                    • AWS EMR
                    • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                      Support Data Engineering Podcast

                      ...more
                      View all episodesView all episodes
                      Download on the App Store

                      Data Engineering PodcastBy Tobias Macey

                      • 4.5
                      • 4.5
                      • 4.5
                      • 4.5
                      • 4.5

                      4.5

                      142 ratings


                      More shows like Data Engineering Podcast

                      View all
                      The Changelog: Software Development, Open Source by Changelog Media

                      The Changelog: Software Development, Open Source

                      289 Listeners

                      Software Engineering Daily by Software Engineering Daily

                      Software Engineering Daily

                      623 Listeners

                      Talk Python To Me by Michael Kennedy

                      Talk Python To Me

                      583 Listeners

                      Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                      Super Data Science: ML & AI Podcast with Jon Krohn

                      302 Listeners

                      NVIDIA AI Podcast by NVIDIA

                      NVIDIA AI Podcast

                      334 Listeners

                      Practical AI by Practical AI LLC

                      Practical AI

                      203 Listeners

                      AWS Podcast by Amazon Web Services

                      AWS Podcast

                      205 Listeners

                      Last Week in AI by Skynet Today

                      Last Week in AI

                      305 Listeners

                      Dwarkesh Podcast by Dwarkesh Patel

                      Dwarkesh Podcast

                      517 Listeners

                      The Data Engineering Show by The Firebolt Data Bros

                      The Data Engineering Show

                      8 Listeners

                      No Priors: Artificial Intelligence | Technology | Startups by Conviction

                      No Priors: Artificial Intelligence | Technology | Startups

                      130 Listeners

                      Latent Space: The AI Engineer Podcast by swyx + Alessio

                      Latent Space: The AI Engineer Podcast

                      92 Listeners

                      This Day in AI Podcast by Michael Sharkey, Chris Sharkey

                      This Day in AI Podcast

                      228 Listeners

                      The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

                      The AI Daily Brief: Artificial Intelligence News and Analysis

                      631 Listeners

                      AI + a16z by a16z

                      AI + a16z

                      36 Listeners