Data Engineering Podcast

Enabling Version Controlled Data Collaboration With TerminusDB


Listen Later

Summary

As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub. In this episode he explains how the TerminusDB system is architected to provide a versioned graph storage engine that allows for branching and merging of data sets, how that opens up new possibilities for individuals and teams to work together on building new data repositories. This is a fascinating conversation on the technical challenges involved, the opportunities that such as system provides, and the complexities inherent to building a successful business on open source.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show.
  • You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat!
  • Your host is Tobias Macey and today I’m interviewing Gavin Mendel-Gleason about TerminusDB, an open source model driven graph database for knowledge graph representation
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by describing what TerminusDB is and what motivated you to build it?
    • What are the use cases that TerminusDB and TerminusHub are designed for?
    • There are a number of different reasons and methods for versioning data, such as the work being done with Datomic, LakeFS, DVC, etc. Where does TerminusDB fit in relation to those and other data versioning systems that are available today?
    • Can you describe how TerminusDB is implemented?
      • How has the design changed or evolved since you first began working on it?
      • What was the decision process and design considerations that led you to choose Prolog as the implementation language?
      • One of the challenges that have faced other knowledge engines built around RDF is that of scale and performance. How are you addressing those difficulties in TerminusDB?
      • What are the scaling factors and limitations for TerminusDB? (e.g. volumes of data, clustering, etc.)
      • How does the use of RDF triples and JSON-LD impact the audience for TerminusDB?
      • How much overhead is incurred by maintaining a long history of changes for a database?
        • How do you handle garbage collection/compaction of versions?
        • How does the availability of branching and merging strategies change the approach that data teams take when working on a project?
        • What are the edge cases in merging and conflict resolution, and what tools does TerminusDB/TerminusHub provide for working through those situations?
        • What are some useful strategies that teams should be aware of for working effectively with collaborative datasets in TerminusDB?
        • Another interesting element of the TerminusDB platform is the query language. What did you use as inspiration for designing it and how much of a learning curve is involved?
        • What are some of the most interesting, innovative, or unexpected ways that you have seen TerminusDB used?
        • https://en.wikipedia.org/wiki/Semantic_Web-?utm_source=rss&utm_medium=rss What are the most interesting, unexpected, or challenging lessons that you have learned while building TerminusDB and TerminusHub?
        • When is TerminusDB the wrong choice?
        • What do you have planned for the future of the project?
        • Contact Info
          • @GavinMGleason on Twitter
          • LinkedIn
          • GavinMendelGleason on GitHub
          • Parting Question
            • From your perspective, what is the biggest gap in the tooling or technology for data management today?
            • Links
              • TerminusDB
              • TerminusHub
              • Chem Informatics
              • Type Theory
              • Graph Database
              • Trinity College Dublin
              • Sesshat Databank analytics over civilizations in history
              • PostgreSQL
              • DGraph
              • Grakn
              • Neo4J
              • Datomic
              • LakeFS
              • DVC
              • Dolt
              • Persistent Succinct Data Structure
              • Currying
              • Prolog
              • WOQL TerminusDB query language
              • RDF
              • JSON-LD
              • Semantic Web
              • Property Graph
              • Hypergraph
              • Super Node
              • Bloom Filters
              • Data Curation
                • Podcast Episode
                • CRDT == Conflict-Free Replicated Data Types
                  • Podcast Episode
                  • SPARQL
                  • Datalog
                  • AST == Abstract Syntax Tree
                  • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                    Support Data Engineering Podcast

                    ...more
                    View all episodesView all episodes
                    Download on the App Store

                    Data Engineering PodcastBy Tobias Macey

                    • 4.5
                    • 4.5
                    • 4.5
                    • 4.5
                    • 4.5

                    4.5

                    140 ratings


                    More shows like Data Engineering Podcast

                    View all
                    Software Engineering Radio by se-radio@computer.org

                    Software Engineering Radio

                    273 Listeners

                    The Changelog: Software Development, Open Source by Changelog Media

                    The Changelog: Software Development, Open Source

                    292 Listeners

                    Software Engineering Daily by Software Engineering Daily

                    Software Engineering Daily

                    624 Listeners

                    The Cloudcast by Massive Studios

                    The Cloudcast

                    153 Listeners

                    Talk Python To Me by Michael Kennedy

                    Talk Python To Me

                    585 Listeners

                    Thoughtworks Technology Podcast by Thoughtworks

                    Thoughtworks Technology Podcast

                    42 Listeners

                    Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                    Super Data Science: ML & AI Podcast with Jon Krohn

                    303 Listeners

                    Python Bytes by Michael Kennedy and Brian Okken

                    Python Bytes

                    214 Listeners

                    Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

                    Syntax - Tasty Web Development Treats

                    983 Listeners

                    DataFramed by DataCamp

                    DataFramed

                    268 Listeners

                    Practical AI by Practical AI LLC

                    Practical AI

                    212 Listeners

                    AWS Podcast by Amazon Web Services

                    AWS Podcast

                    201 Listeners

                    The Stack Overflow Podcast by The Stack Overflow Podcast

                    The Stack Overflow Podcast

                    62 Listeners

                    The Real Python Podcast by Real Python

                    The Real Python Podcast

                    141 Listeners

                    Latent Space: The AI Engineer Podcast by swyx + Alessio

                    Latent Space: The AI Engineer Podcast

                    96 Listeners