Data Engineering Podcast

Build Maintainable And Testable Data Applications With Dagster


Listen Later

Summary

Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source system for building modern data applications
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by explaining what Dagster is and the origin story for the project?
    • In the tagline for Dagster you describe it as "a system for building modern data applications". There are a lot of contending terms that one might use in this context, such as ETL, data pipelines, etc. Can you describe your thinking as to what the term "data application" means, and the types of use cases that Dagster is well suited for?
    • Can you talk through how Dagster is architected and some of the ways that it has evolved since you first began working on it?
      • What do you see as the current industry trends that are leading us away from full stack frameworks such as Airflow and Oozie for ETL and into an abstracted programming environment that is composable with different execution contexts?
      • What are some of the initial assumptions that you had which have been challenged or updated in the process of working with users of Dagster?
      • For someone who wants to extend Dagster, or integrate it with other components of their data infrastructure, such as a metadata engine, what interfaces do you provide for extensibility?
      • For someone who wants to get started with Dagster can you describe a typical workflow for writing a data pipeline?
        • Once they have something working, what is involved in deploying it?
        • One of the things that stands out about Dagster is the strong contracts that it enforces between computation nodes, or "solids". Why do you feel that those contracts are necessary, and what benefits do they provide during the full lifecycle of a data application?
        • Another difficult aspect of data applications is testing, both before and after deploying it to a production environment. How does Dagster help in that regard?
        • It is also challenging to keep track of the entirety of a DAG for a given workflow. How does Dagit keep track of the task dependencies, and what are the limitations of that tool?
        • Can you give an overview of where you see Dagster fitting in the overall ecosystem of data tools?
        • What are some of the features or capabilities of Dagster which are often overlooked that you would like to highlight for the listeners?
        • Your recent release of Dagster includes a built-in scheduler, as well as a built-in deployment capability. Why did you feel that those were necessary capabilities to incorporate, rather than continuing to leave that as end-user considerations?
        • You have built a new company around Dagster in the form of Elementl. How are you approaching sustainability and governance of Dagster, and what is your path to sustainability for the business?
        • What should listeners be keeping an eye out for in the near to medium future from Elementl and Dagster?
          • What is on your roadmap that you consider necessary before creating a 1.0 release?
          • Contact Info
            • @schrockn on Twitter
            • schrockn on GitHub
            • LinkedIn
            • Parting Question
              • From your perspective, what is the biggest gap in the tooling or technology for data management today?
              • Links
                • Dagster
                • Elementl
                • ETL
                • GraphQL
                • React
                • Matei Zaharia
                • DataOps Episode
                • Kafka
                • Fivetran
                  • Podcast Episode
                  • Spark
                  • Supervised Learning
                  • DevOps
                  • Luigi
                  • Airflow
                  • Dask
                    • Podcast Episode
                    • Kubernetes
                    • Ray
                    • Maxime Beauchemin
                      • Podcast Interview
                      • Dagster Testing Guide
                      • Great Expectations
                        • Podcast.__init__ Interview
                        • Papermill
                          • Notebooks At Netflix Episode
                          • DBT
                            • Podcast Episode
                            • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                              Support Data Engineering Podcast

                              ...more
                              View all episodesView all episodes
                              Download on the App Store

                              Data Engineering PodcastBy Tobias Macey

                              • 4.6
                              • 4.6
                              • 4.6
                              • 4.6
                              • 4.6

                              4.6

                              135 ratings


                              More shows like Data Engineering Podcast

                              View all
                              Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                              Software Engineering Radio - the podcast for professional software developers

                              272 Listeners

                              The Changelog: Software Development, Open Source by Changelog Media

                              The Changelog: Software Development, Open Source

                              283 Listeners

                              The Cloudcast by Massive Studios

                              The Cloudcast

                              152 Listeners

                              Thoughtworks Technology Podcast by Thoughtworks

                              Thoughtworks Technology Podcast

                              41 Listeners

                              Data Skeptic by Kyle Polich

                              Data Skeptic

                              482 Listeners

                              Talk Python To Me by Michael Kennedy

                              Talk Python To Me

                              592 Listeners

                              Software Engineering Daily by Software Engineering Daily

                              Software Engineering Daily

                              625 Listeners

                              The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                              The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                              443 Listeners

                              Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                              Super Data Science: ML & AI Podcast with Jon Krohn

                              296 Listeners

                              Python Bytes by Michael Kennedy and Brian Okken

                              Python Bytes

                              213 Listeners

                              DataFramed by DataCamp

                              DataFramed

                              266 Listeners

                              Practical AI by Practical AI LLC

                              Practical AI

                              189 Listeners

                              The Stack Overflow Podcast by The Stack Overflow Podcast

                              The Stack Overflow Podcast

                              64 Listeners

                              The Real Python Podcast by Real Python

                              The Real Python Podcast

                              140 Listeners

                              Latent Space: The AI Engineer Podcast by swyx + Alessio

                              Latent Space: The AI Engineer Podcast

                              77 Listeners