Data Engineering Podcast

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50


Listen Later

Summary

There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.

Preamble
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?
      • How do you define the concept of a knowledge graph?

      • What are the processes involved in constructing a knowledge graph?

      • Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?

      • What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?

        • How do you manage the software lifecycle for your ETL code?
        • What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?

        • What are the current challenges that you are facing in building and scaling your data infrastructure?

          • How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?
          • What techniques are you using to manage accuracy and consistency in the data that you ingest?

          • Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?

          • What are the weak spots in your platform that you are planning to address in upcoming projects?

            • If you were to start from scratch today, what would you have done differently?

            • What are some of the most interesting or unexpected uses of your product that you have seen?

            • What is in store for the future of Enigma?

            • Contact Info
              • Email
              • Twitter
              • Parting Question
                • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                • Links
                  • Enigma
                  • Chicago Tribune
                  • NPR
                  • Quartz
                  • CSVKit
                  • Agate
                  • Knowledge Graph
                  • Taxonomy
                  • Concourse
                  • Airflow
                  • Docker
                  • S3
                  • Data Lake
                  • Parquet
                    • Podcast Episode

                    • Spark

                    • AWS Neptune

                    • AWS Batch

                    • Money Laundering

                    • Jupyter Notebook

                    • Papermill

                    • Jupytext

                    • Cauldron: The Un-Notebook

                      • Podcast.__init__ Episode

                      • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                        Support Data Engineering Podcast

                        ...more
                        View all episodesView all episodes
                        Download on the App Store

                        Data Engineering PodcastBy Tobias Macey

                        • 4.6
                        • 4.6
                        • 4.6
                        • 4.6
                        • 4.6

                        4.6

                        135 ratings


                        More shows like Data Engineering Podcast

                        View all
                        Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                        Software Engineering Radio - the podcast for professional software developers

                        272 Listeners

                        The Changelog: Software Development, Open Source by Changelog Media

                        The Changelog: Software Development, Open Source

                        283 Listeners

                        The Cloudcast by Massive Studios

                        The Cloudcast

                        153 Listeners

                        Thoughtworks Technology Podcast by Thoughtworks

                        Thoughtworks Technology Podcast

                        41 Listeners

                        Data Skeptic by Kyle Polich

                        Data Skeptic

                        483 Listeners

                        Talk Python To Me by Michael Kennedy

                        Talk Python To Me

                        592 Listeners

                        Software Engineering Daily by Software Engineering Daily

                        Software Engineering Daily

                        624 Listeners

                        The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                        The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                        444 Listeners

                        Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                        Super Data Science: ML & AI Podcast with Jon Krohn

                        298 Listeners

                        Python Bytes by Michael Kennedy and Brian Okken

                        Python Bytes

                        213 Listeners

                        DataFramed by DataCamp

                        DataFramed

                        266 Listeners

                        Practical AI by Practical AI LLC

                        Practical AI

                        190 Listeners

                        The Stack Overflow Podcast by The Stack Overflow Podcast

                        The Stack Overflow Podcast

                        64 Listeners

                        The Real Python Podcast by Real Python

                        The Real Python Podcast

                        140 Listeners

                        Latent Space: The AI Engineer Podcast by swyx + Alessio

                        Latent Space: The AI Engineer Podcast

                        77 Listeners