August 20, 2018

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

42 minutes

Summary

The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management

When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.

If you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data.

Python has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience

as a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40!

Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.

Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Your host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database.

Interview

Introduction

How did you get involved in the area of data management?

What is DGraph and what motivated you to build it?

Graph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems?

The graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings?

What are some of the common uses of graph storage systems?

What are some potential uses that are often overlooked?

There are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How is information represented in DGraph and what are the tradeoffs in the approach that you chose?

How does the query interface and data storage in DGraph differ from other options?

What are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL?

How is DGraph architected and how has that architecture evolved from when it first started?

How do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write?

In your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition?

What are the limitations of DGraph in terms of scalability or usability?

Where does it fall along the axes of the CAP theorem?

For someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like?

What have been the most challenging aspects of building and growing the DGraph project and community?

What are some of the most interesting or unexpected uses of DGraph that you are aware of?

When is DGraph the wrong choice?

What are your plans for the future of DGraph?

Contact Info

@manishrjain on Twitter

manishrjain on GitHub

Blog

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

DGraph

Badger

Google Knowledge Graph

Graph Theory

Graph Database

SQL

Relational Database

NoSQL

OLTP (On-Line Transaction Processing)

Neo4J

PostgreSQL

MySQL

BigTable

Recommendation System

Fraud Detection

Customer 360

Usenet Express

IPFS

Gremlin

Cypher

GSQL

GraphQL

MetaWeb

RAFT

Spanner

HBase

Elasticsearch

Kubernetes

TLS (Transport Layer Security)

Jepsen Tests

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

...more

View all episodes

By Tobias Macey

4.5

142142 ratings

August 20, 2018

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

42 minutes

Summary

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management

Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.

Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Your host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database.

Interview

Introduction

How did you get involved in the area of data management?

What is DGraph and what motivated you to build it?

Graph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems?

The graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings?

What are some of the common uses of graph storage systems?

What are some potential uses that are often overlooked?

How does the query interface and data storage in DGraph differ from other options?

What are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL?

How is DGraph architected and how has that architecture evolved from when it first started?

How do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write?

In your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition?

What are the limitations of DGraph in terms of scalability or usability?

Where does it fall along the axes of the CAP theorem?

For someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like?

What have been the most challenging aspects of building and growing the DGraph project and community?

What are some of the most interesting or unexpected uses of DGraph that you are aware of?

When is DGraph the wrong choice?

What are your plans for the future of DGraph?

Contact Info

@manishrjain on Twitter

manishrjain on GitHub

Blog

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

DGraph

Badger

Google Knowledge Graph

Graph Theory

Graph Database

SQL

Relational Database

NoSQL

OLTP (On-Line Transaction Processing)

Neo4J

PostgreSQL

MySQL

BigTable

Recommendation System

Fraud Detection

Customer 360

Usenet Express

IPFS

Gremlin

Cypher

GSQL

GraphQL

MetaWeb

RAFT

Spanner

HBase

Elasticsearch

Kubernetes

TLS (Transport Layer Security)

Jepsen Tests

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

...more

More shows like Data Engineering Podcast

View all

This Week in Startups

1,290 Listeners

The Changelog: Software Development, Open Source

289 Listeners

The a16z Show

1,093 Listeners

Software Engineering Daily

626 Listeners

Risky Business

375 Listeners

Talk Python To Me

583 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn

301 Listeners

NVIDIA AI Podcast

345 Listeners

Syntax - Tasty Web Development Treats

982 Listeners

Practical AI

208 Listeners

Dwarkesh Podcast

576 Listeners

The Data Engineering Show

8 Listeners

Latent Space: The AI Engineer Podcast

101 Listeners

This Day in AI Podcast

226 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis

682 Listeners

Share Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Sign up to save your podcasts

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

More shows like Data Engineering Podcast

This Week in Startups

The Changelog: Software Development, Open Source

The a16z Show

Software Engineering Daily

Risky Business

Talk Python To Me

Super Data Science: ML & AI Podcast with Jon Krohn

NVIDIA AI Podcast

Syntax - Tasty Web Development Treats

Practical AI

Dwarkesh Podcast

The Data Engineering Show

Latent Space: The AI Engineer Podcast

This Day in AI Podcast

The AI Daily Brief: Artificial Intelligence News and Analysis