August 17, 2020

Exploring The TileDB Universal Data Engine

1 hour 5 minutes

Summary

Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. In this episode the creator and founder of TileDB shares how he first started working on the underlying technology and the benefits of using a single engine for efficiently storing and querying any form of data. He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. This was a great conversation about a different approach to database architecture and how that enables a more flexible way to store and interact with data to power better data sharing and new opportunities for blending specialized domains.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management

What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.

When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!

Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company.

Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.

You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!

Your host is Tobias Macey and today I’m interviewing Stavros Papadopoulos about TileDB, the universal storage engine

Interview

Introduction

How did you get involved in the area of data management?

Can you start by describing what TileDB is and the problem that you are trying to solve with it?

What was your motivation for building it?

What are the main use cases or problem domains that you are trying to solve for?

What are the shortcomings of existing approaches to database design that prevent them from being useful for these applications?

What are the benefits of using matrices for data processing and domain modeling?

What are the challenges that you have faced in storing and processing sparse matrices efficiently?

How does the usage of matrices as the foundational primitive affect the way that users should think about data modeling?

What are the benefits of unbundling the storage engine from the processing layer

Can you describe how TileDB embedded is architected?

How has the design evolved since you first began working on it?

What is your approach to integrating with the broader ecosystem of data storage and processing utilities?

What does the workflow look like for someone using TileDB?

What is required to deploy TileDB in a production context?

How is the built in data versioning implemented?

What is the user experience for interacting with different versions of datasets?

How do you manage the lifecycle of versioned data to allow garbage collection?

How are you managing the governance and ongoing sustainability of the open source project, and the commercial offerings that you are building on top of it?

What are the most interesting, unexpected, or innovative ways that you have seen TileDB used?

What have you found to be the most interesting, unexpected, or challenging aspects of building TileDB?

What features or capabilities are you consciously deciding not to implement?

When is TileDB the wrong choice?

What do you have planned for the future of TileDB?

Contact Info

stavrospapadopoulos on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

TileDB

GitHub

Data Frames

TileDB Cloud

MIT

Intel

Sparse Linear Algebra

Sparse Matrices

HDF5

Dask

Spark

MariaDB

PrestoDB

GDAL

PDAL

Turing Complete

Clustered Index

Parquet File Format

Podcast Episode

Serializability

Delta Lake