Data Engineering Podcast

Exploring The TileDB Universal Data Engine


Listen Later

Summary

Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. In this episode the creator and founder of TileDB shares how he first started working on the underlying technology and the benefits of using a single engine for efficiently storing and querying any form of data. He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. This was a great conversation about a different approach to database architecture and how that enables a more flexible way to store and interact with data to power better data sharing and new opportunities for blending specialized domains.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.
  • Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
  • Your host is Tobias Macey and today I’m interviewing Stavros Papadopoulos about TileDB, the universal storage engine
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by describing what TileDB is and the problem that you are trying to solve with it?
      • What was your motivation for building it?
      • What are the main use cases or problem domains that you are trying to solve for?
        • What are the shortcomings of existing approaches to database design that prevent them from being useful for these applications?
        • What are the benefits of using matrices for data processing and domain modeling?
          • What are the challenges that you have faced in storing and processing sparse matrices efficiently?
          • How does the usage of matrices as the foundational primitive affect the way that users should think about data modeling?
          • What are the benefits of unbundling the storage engine from the processing layer
          • Can you describe how TileDB embedded is architected?
            • How has the design evolved since you first began working on it?
            • What is your approach to integrating with the broader ecosystem of data storage and processing utilities?
            • What does the workflow look like for someone using TileDB?
            • What is required to deploy TileDB in a production context?
            • How is the built in data versioning implemented?
              • What is the user experience for interacting with different versions of datasets?
              • How do you manage the lifecycle of versioned data to allow garbage collection?
              • How are you managing the governance and ongoing sustainability of the open source project, and the commercial offerings that you are building on top of it?
              • What are the most interesting, unexpected, or innovative ways that you have seen TileDB used?
              • What have you found to be the most interesting, unexpected, or challenging aspects of building TileDB?
              • What features or capabilities are you consciously deciding not to implement?
              • When is TileDB the wrong choice?
              • What do you have planned for the future of TileDB?
              • Contact Info
                • LinkedIn
                • stavrospapadopoulos on GitHub
                • Parting Question
                  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                  • Links
                    • TileDB
                      • GitHub
                      • Data Frames
                      • TileDB Cloud
                      • MIT
                      • Intel
                      • Sparse Linear Algebra
                      • Sparse Matrices
                      • HDF5
                      • Dask
                      • Spark
                      • MariaDB
                      • PrestoDB
                      • GDAL
                      • PDAL
                      • Turing Complete
                      • Clustered Index
                      • Parquet File Format
                        • Podcast Episode
                        • Serializability
                        • Delta Lake
                          • Podcast Episode
                          • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                            Support Data Engineering Podcast

                            ...more
                            View all episodesView all episodes
                            Download on the App Store

                            Data Engineering PodcastBy Tobias Macey

                            • 4.6
                            • 4.6
                            • 4.6
                            • 4.6
                            • 4.6

                            4.6

                            135 ratings


                            More shows like Data Engineering Podcast

                            View all
                            Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                            Software Engineering Radio - the podcast for professional software developers

                            272 Listeners

                            The Changelog: Software Development, Open Source by Changelog Media

                            The Changelog: Software Development, Open Source

                            283 Listeners

                            The Cloudcast by Massive Studios

                            The Cloudcast

                            152 Listeners

                            Thoughtworks Technology Podcast by Thoughtworks

                            Thoughtworks Technology Podcast

                            41 Listeners

                            Data Skeptic by Kyle Polich

                            Data Skeptic

                            482 Listeners

                            Talk Python To Me by Michael Kennedy

                            Talk Python To Me

                            592 Listeners

                            Software Engineering Daily by Software Engineering Daily

                            Software Engineering Daily

                            625 Listeners

                            The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                            The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                            443 Listeners

                            Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                            Super Data Science: ML & AI Podcast with Jon Krohn

                            296 Listeners

                            Python Bytes by Michael Kennedy and Brian Okken

                            Python Bytes

                            213 Listeners

                            DataFramed by DataCamp

                            DataFramed

                            266 Listeners

                            Practical AI by Practical AI LLC

                            Practical AI

                            189 Listeners

                            The Stack Overflow Podcast by The Stack Overflow Podcast

                            The Stack Overflow Podcast

                            64 Listeners

                            The Real Python Podcast by Real Python

                            The Real Python Podcast

                            140 Listeners

                            Latent Space: The AI Engineer Podcast by swyx + Alessio

                            Latent Space: The AI Engineer Podcast

                            77 Listeners