Data Engineering Podcast

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata


Listen Later

Summary

Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management

  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!

  • Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.

  • Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.

  • Your host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data.

    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Schemata is and the story behind it?
      • How does the garbage in/garbage out problem manifest in data warehouse/data lake environments?
      • What are the different places in a data system that schema definitions need to be established?
        • What are the different ways that schema management gets complicated across those various points of interaction?
        • Can you walk me through the end-to-end flow of how Schemata integrates with engineering practices across an organization’s data lifecycle?
          • How does the use of Schemata help with capturing and propagating context that would otherwise be lost or siloed?
          • How is the Schemata utility implemented?
            • What are some of the design and scope questions that you had to work through while developing Schemata?
            • What is the broad vision that you have for Schemata and its impact on data practices?
            • How are you balancing the need for flexibility/adaptability with the desire for ease of adoption and quick wins?
            • The core of the utility is the generation of structured messages How are those messages propagated, stored, and analyzed?
            • What are the pieces of Schemata and its usage that are still undefined?
            • What are the most interesting, innovative, or unexpected ways that you have seen Schemata used?
            • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Schemata?
            • When is Schemata the wrong choice?
            • What do you have planned for the future of Schemata?
            • Contact Info
              • ananthdurai on GitHub
              • @ananthdurai on Twitter
              • LinkedIn
              • Parting Question
                • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                • Closing Announcements
                  • Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
                  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
                  • Links
                    • Schemata
                    • Data Engineering Weekly
                    • Zendesk
                    • Ralph Kimball
                    • Data Warehouse Toolkit
                    • Iteratively
                      • Podcast Episode
                      • Protocol Buffers (protobuf)
                      • Application Tracing
                      • OpenTelemetry
                      • Django
                      • Spring Framework
                      • Dependency Injection
                      • JSON Schema
                      • dbt
                        • Podcast Episode
                        • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                          Support Data Engineering Podcast

                          ...more
                          View all episodesView all episodes
                          Download on the App Store

                          Data Engineering PodcastBy Tobias Macey

                          • 4.6
                          • 4.6
                          • 4.6
                          • 4.6
                          • 4.6

                          4.6

                          135 ratings


                          More shows like Data Engineering Podcast

                          View all
                          Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                          Software Engineering Radio - the podcast for professional software developers

                          272 Listeners

                          The Changelog: Software Development, Open Source by Changelog Media

                          The Changelog: Software Development, Open Source

                          282 Listeners

                          The Cloudcast by Massive Studios

                          The Cloudcast

                          152 Listeners

                          Thoughtworks Technology Podcast by Thoughtworks

                          Thoughtworks Technology Podcast

                          42 Listeners

                          Data Skeptic by Kyle Polich

                          Data Skeptic

                          481 Listeners

                          Talk Python To Me by Michael Kennedy

                          Talk Python To Me

                          590 Listeners

                          Software Engineering Daily by Software Engineering Daily

                          Software Engineering Daily

                          626 Listeners

                          The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                          The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                          440 Listeners

                          Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                          Super Data Science: ML & AI Podcast with Jon Krohn

                          299 Listeners

                          Python Bytes by Michael Kennedy and Brian Okken

                          Python Bytes

                          213 Listeners

                          DataFramed by DataCamp

                          DataFramed

                          265 Listeners

                          Practical AI by Practical AI LLC

                          Practical AI

                          189 Listeners

                          The Stack Overflow Podcast by The Stack Overflow Podcast

                          The Stack Overflow Podcast

                          64 Listeners

                          The Real Python Podcast by Real Python

                          The Real Python Podcast

                          140 Listeners

                          Latent Space: The AI Engineer Podcast by swyx + Alessio

                          Latent Space: The AI Engineer Podcast

                          76 Listeners