September 12, 2022

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

59 minutes

Summary

Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management

When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.

Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.

Your host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data.

Interview

Introduction

How did you get involved in the area of data management?

Can you describe what Schemata is and the story behind it?

How does the garbage in/garbage out problem manifest in data warehouse/data lake environments?

What are the different places in a data system that schema definitions need to be established?

What are the different ways that schema management gets complicated across those various points of interaction?

Can you walk me through the end-to-end flow of how Schemata integrates with engineering practices across an organization’s data lifecycle?

How does the use of Schemata help with capturing and propagating context that would otherwise be lost or siloed?

How is the Schemata utility implemented?

What are some of the design and scope questions that you had to work through while developing Schemata?

What is the broad vision that you have for Schemata and its impact on data practices?

How are you balancing the need for flexibility/adaptability with the desire for ease of adoption and quick wins?

The core of the utility is the generation of structured messages How are those messages propagated, stored, and analyzed?

What are the pieces of Schemata and its usage that are still undefined?

What are the most interesting, innovative, or unexpected ways that you have seen Schemata used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on Schemata?

When is Schemata the wrong choice?

What do you have planned for the future of Schemata?

Contact Info

ananthdurai on GitHub

@ananthdurai on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.

If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.

To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Schemata

Data Engineering Weekly

Zendesk

Ralph Kimball

Data Warehouse Toolkit

Iteratively

Podcast Episode

Protocol Buffers (protobuf)

Application Tracing

OpenTelemetry

Django

Spring Framework

Dependency Injection

JSON Schema

dbt

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

...more

View all episodes

By Tobias Macey

4.5

142142 ratings

September 12, 2022

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

59 minutes

Summary

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management

Your host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data.

Interview

Introduction

How did you get involved in the area of data management?

Can you describe what Schemata is and the story behind it?

How does the garbage in/garbage out problem manifest in data warehouse/data lake environments?

What are the different places in a data system that schema definitions need to be established?

What are the different ways that schema management gets complicated across those various points of interaction?

Can you walk me through the end-to-end flow of how Schemata integrates with engineering practices across an organization’s data lifecycle?

How does the use of Schemata help with capturing and propagating context that would otherwise be lost or siloed?

How is the Schemata utility implemented?

What are some of the design and scope questions that you had to work through while developing Schemata?

What is the broad vision that you have for Schemata and its impact on data practices?

How are you balancing the need for flexibility/adaptability with the desire for ease of adoption and quick wins?

The core of the utility is the generation of structured messages How are those messages propagated, stored, and analyzed?

What are the pieces of Schemata and its usage that are still undefined?

What are the most interesting, innovative, or unexpected ways that you have seen Schemata used?

What are the most interesting, unexpected, or challenging lessons that you have learned while working on Schemata?

When is Schemata the wrong choice?

What do you have planned for the future of Schemata?

Contact Info

ananthdurai on GitHub

@ananthdurai on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.

If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.

To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Schemata

Data Engineering Weekly

Zendesk

Ralph Kimball

Data Warehouse Toolkit

Iteratively

Podcast Episode

Protocol Buffers (protobuf)

Application Tracing

OpenTelemetry

Django

Spring Framework

Dependency Injection

JSON Schema

dbt

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

...more

More shows like Data Engineering Podcast

View all

This Week in Startups

1,290 Listeners

The Changelog: Software Development, Open Source

289 Listeners

The a16z Show

1,093 Listeners

Software Engineering Daily

626 Listeners

Risky Business

375 Listeners

Talk Python To Me

583 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn

301 Listeners

NVIDIA AI Podcast

345 Listeners

Syntax - Tasty Web Development Treats

982 Listeners

Practical AI

208 Listeners

Dwarkesh Podcast

576 Listeners

The Data Engineering Show

8 Listeners

Latent Space: The AI Engineer Podcast

101 Listeners

This Day in AI Podcast

226 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis

682 Listeners

Share Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Sign up to save your podcasts

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

More shows like Data Engineering Podcast

This Week in Startups

The Changelog: Software Development, Open Source

The a16z Show

Software Engineering Daily

Risky Business

Talk Python To Me

Super Data Science: ML & AI Podcast with Jon Krohn

NVIDIA AI Podcast

Syntax - Tasty Web Development Treats

Practical AI

Dwarkesh Podcast

The Data Engineering Show

Latent Space: The AI Engineer Podcast

This Day in AI Podcast

The AI Daily Brief: Artificial Intelligence News and Analysis