Data Engineering Podcast

Achieving Data Reliability: The Role of Data Contracts in Modern Data Management


Listen Later

Summary
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
  • Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe the scope and purpose of data contracts in the context of this conversation?
  • In what way(s) do they differ from data quality/data observability?
  • Data contracts are also known as the API for data, can you elaborate on this?
  • What are the types of guarantees and requirements that you can enforce with these data contracts?
  • What are some examples of constraints or guarantees that cannot be represented in these contracts?
  • Are data contracts related to the shift-left?
  • Data contracts are also known as the API for data, can you elaborate on this?
  • The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
  • How did you approach the design of the syntax and implementation for Soda's data contracts?
  • Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
  • Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
  • What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
  • When are data contracts the wrong choice?
  • What do you have planned for the future of data contracts?
Contact Info
  • LinkedIn
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
Links
  • Soda
  • Podcast Episode
  • JBoss
  • Data Contract
  • Airflow
  • Unit Testing
  • Integration Testing
  • OpenAPI
  • GraphQL
  • Circuit Breaker Pattern
  • SodaCL
  • Soda Data Contracts
  • Data Mesh
  • Great Expectations
  • dbt Unit Tests
  • Open Data Contracts
  • ODCS == Open Data Contract Standard
  • ODPS == Open Data Product Specification
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
...more
View all episodesView all episodes
Download on the App Store

Data Engineering PodcastBy Tobias Macey

  • 4.5
  • 4.5
  • 4.5
  • 4.5
  • 4.5

4.5

142 ratings


More shows like Data Engineering Podcast

View all
This Week in Startups by Jason Calacanis

This Week in Startups

1,301 Listeners

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

288 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,109 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

631 Listeners

Risky Business by Risky Business Media

Risky Business

373 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

583 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

308 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

347 Listeners

Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

Syntax - Tasty Web Development Treats

990 Listeners

Practical AI by Practical AI LLC

Practical AI

211 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

549 Listeners

The Data Engineering Show by The Firebolt Data Bros

The Data Engineering Show

9 Listeners

Latent Space: The AI Engineer Podcast by Latent.Space

Latent Space: The AI Engineer Podcast

105 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

227 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

681 Listeners