Data Engineering Podcast

A Primer On Enterprise Data Curation with Todd Walter - Episode 49


Listen Later

Summary

As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.

Preamble
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • How do you define data curation?
      • What are some of the high level concerns that are encapsulated in that effort?
      • How does the size and maturity of a company affect the ways that they architect and interact with their data systems?
      • Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it?
      • What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure?
      • What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space?
      • As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep?
      • In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?
        • What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?
        • Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure?
        • ETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach?
        • What are some of the areas of data architecture and curation that are most often forgotten or ignored?
        • What resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation?
        • Contact Info
          • LinkedIn
          • Parting Question
            • From your perspective, what is the biggest gap in the tooling or technology for data management today?
            • Links
              • Teradata
              • Data Architecture
              • Data Curation
              • Data Warehouse
              • Chief Data Officer
              • ETL (Extract, Transform, Load)
              • Data Lake
              • Metadata
              • Data Lineage
                • Data Provenance
                • Strata Conference
                • ELT (Extract, Load, Transform)
                • Map-Reduce
                • Hive
                • Pig
                • Spark
                • Data Governance
                • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                  Support Data Engineering Podcast

                  ...more
                  View all episodesView all episodes
                  Download on the App Store

                  Data Engineering PodcastBy Tobias Macey

                  • 4.6
                  • 4.6
                  • 4.6
                  • 4.6
                  • 4.6

                  4.6

                  135 ratings


                  More shows like Data Engineering Podcast

                  View all
                  Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                  Software Engineering Radio - the podcast for professional software developers

                  272 Listeners

                  The Changelog: Software Development, Open Source by Changelog Media

                  The Changelog: Software Development, Open Source

                  283 Listeners

                  The Cloudcast by Massive Studios

                  The Cloudcast

                  153 Listeners

                  Thoughtworks Technology Podcast by Thoughtworks

                  Thoughtworks Technology Podcast

                  41 Listeners

                  Data Skeptic by Kyle Polich

                  Data Skeptic

                  483 Listeners

                  Talk Python To Me by Michael Kennedy

                  Talk Python To Me

                  592 Listeners

                  Software Engineering Daily by Software Engineering Daily

                  Software Engineering Daily

                  624 Listeners

                  The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                  The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                  444 Listeners

                  Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                  Super Data Science: ML & AI Podcast with Jon Krohn

                  298 Listeners

                  Python Bytes by Michael Kennedy and Brian Okken

                  Python Bytes

                  213 Listeners

                  DataFramed by DataCamp

                  DataFramed

                  266 Listeners

                  Practical AI by Practical AI LLC

                  Practical AI

                  190 Listeners

                  The Stack Overflow Podcast by The Stack Overflow Podcast

                  The Stack Overflow Podcast

                  64 Listeners

                  The Real Python Podcast by Real Python

                  The Real Python Podcast

                  140 Listeners

                  Latent Space: The AI Engineer Podcast by swyx + Alessio

                  Latent Space: The AI Engineer Podcast

                  77 Listeners