Data Engineering Podcast

A Primer On Enterprise Data Curation with Todd Walter - Episode 49


Listen Later

Summary

As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.

Preamble
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • How do you define data curation?
      • What are some of the high level concerns that are encapsulated in that effort?
      • How does the size and maturity of a company affect the ways that they architect and interact with their data systems?
      • Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it?
      • What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure?
      • What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space?
      • As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep?
      • In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?
        • What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?
        • Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure?
        • ETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach?
        • What are some of the areas of data architecture and curation that are most often forgotten or ignored?
        • What resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation?
        • Contact Info
          • LinkedIn
          • Parting Question
            • From your perspective, what is the biggest gap in the tooling or technology for data management today?
            • Links
              • Teradata
              • Data Architecture
              • Data Curation
              • Data Warehouse
              • Chief Data Officer
              • ETL (Extract, Transform, Load)
              • Data Lake
              • Metadata
              • Data Lineage
                • Data Provenance
                • Strata Conference
                • ELT (Extract, Load, Transform)
                • Map-Reduce
                • Hive
                • Pig
                • Spark
                • Data Governance
                • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                  Support Data Engineering Podcast

                  ...more
                  View all episodesView all episodes
                  Download on the App Store

                  Data Engineering PodcastBy Tobias Macey

                  • 4.5
                  • 4.5
                  • 4.5
                  • 4.5
                  • 4.5

                  4.5

                  142 ratings


                  More shows like Data Engineering Podcast

                  View all
                  The Changelog: Software Development, Open Source by Changelog Media

                  The Changelog: Software Development, Open Source

                  289 Listeners

                  Software Engineering Daily by Software Engineering Daily

                  Software Engineering Daily

                  624 Listeners

                  Talk Python To Me by Michael Kennedy

                  Talk Python To Me

                  583 Listeners

                  Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                  Super Data Science: ML & AI Podcast with Jon Krohn

                  302 Listeners

                  NVIDIA AI Podcast by NVIDIA

                  NVIDIA AI Podcast

                  343 Listeners

                  Practical AI by Practical AI LLC

                  Practical AI

                  204 Listeners

                  AWS Podcast by Amazon Web Services

                  AWS Podcast

                  205 Listeners

                  Last Week in AI by Skynet Today

                  Last Week in AI

                  305 Listeners

                  Dwarkesh Podcast by Dwarkesh Patel

                  Dwarkesh Podcast

                  523 Listeners

                  The Data Engineering Show by The Firebolt Data Bros

                  The Data Engineering Show

                  8 Listeners

                  No Priors: Artificial Intelligence | Technology | Startups by Conviction

                  No Priors: Artificial Intelligence | Technology | Startups

                  129 Listeners

                  Latent Space: The AI Engineer Podcast by swyx + Alessio

                  Latent Space: The AI Engineer Podcast

                  92 Listeners

                  This Day in AI Podcast by Michael Sharkey, Chris Sharkey

                  This Day in AI Podcast

                  227 Listeners

                  The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

                  The AI Daily Brief: Artificial Intelligence News and Analysis

                  633 Listeners

                  AI + a16z by a16z

                  AI + a16z

                  36 Listeners