January 31, 2018

E024 – Will machine learning kill traditional database indexes?

23 minutes

In this episode my friend Vikas Popuri and I chat about Google’s paper comparing ML models to traditional DB indexes.

Background:

Google used learned indexes , machine learning models, to access data and compared these to B-Tree, Hash, and Bloom Filter indices

Trained a model using multiple stages where the earlier stages could approximate a location and later stages would work with a subset to improve accuracy. Each stage could choose a different model to advance the search further.

FYI, the diagram below looks like a decision tree, but it is not. Each stage/model could have different distributions and could repeat the model used above or below.

They achieved access time and space savings across the board, even without using GPUs or TPUs (Tensor Processing Units)

“Retraining the model” – the tests were performed on a static data set, so no retraining or index maintenance was required.

Observations / Questions:

Used Tensorflow with Python as the front end — apparently a lot of initial overhead with this as a test stack.

B-Tree indexes to some extent are a model, especially if they don’t store every key and instead store the first key in a page.

The paper made some rudimentary assumptions, such as using a random hash function.

What if the data is not static? How long would it take to retrain the model vs. maintain an index?

What if data profiling caused you to index certain attributes and not others?

What are the best practices with this newer approach

The power of being able to use different models at different stages is intriguing. You could also potentially maintain traditional indexes as a backup / failsafe that would upper bound to the performance of a B-Tree.

Load times – The folks from Google commented that they could retrain a simple model on a 200M data set in “just [a] few seconds if implemented in C++”

Recursive question: do you need an optimizer to optimize the optimization path?

Room for improvement:

GPUs/TPUs

Incorporating common queries into the model to know what questions people are asking

Music

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

http://learningsys.org/nips17/assets/papers/paper_22.pdf

http://blog.ezyang.com/2017/12/a-machine-learning-approach-to-database-indexes-alex-beutel/

https://towardsdatascience.com/what-if-i-told-you-database-indexes-could-be-learned-6cf8f59bff94

https://www.arxiv-vanity.com/papers/1712.01208/

https://www.oreilly.com/ideas/how-machine-learning-will-accelerate-data-management-systems

...more

View all episodes

By For the Love of Data

January 31, 2018

E024 – Will machine learning kill traditional database indexes?

23 minutes

In this episode my friend Vikas Popuri and I chat about Google’s paper comparing ML models to traditional DB indexes.

Background:

Google used learned indexes , machine learning models, to access data and compared these to B-Tree, Hash, and Bloom Filter indices

FYI, the diagram below looks like a decision tree, but it is not. Each stage/model could have different distributions and could repeat the model used above or below.

They achieved access time and space savings across the board, even without using GPUs or TPUs (Tensor Processing Units)

“Retraining the model” – the tests were performed on a static data set, so no retraining or index maintenance was required.

Observations / Questions:

Used Tensorflow with Python as the front end — apparently a lot of initial overhead with this as a test stack.

B-Tree indexes to some extent are a model, especially if they don’t store every key and instead store the first key in a page.

The paper made some rudimentary assumptions, such as using a random hash function.

What if the data is not static? How long would it take to retrain the model vs. maintain an index?

What if data profiling caused you to index certain attributes and not others?

What are the best practices with this newer approach

Load times – The folks from Google commented that they could retrain a simple model on a 200M data set in “just [a] few seconds if implemented in C++”

Recursive question: do you need an optimizer to optimize the optimization path?

Room for improvement:

GPUs/TPUs

Incorporating common queries into the model to know what questions people are asking

Music

Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

Sources:

http://learningsys.org/nips17/assets/papers/paper_22.pdf

http://blog.ezyang.com/2017/12/a-machine-learning-approach-to-database-indexes-alex-beutel/

https://towardsdatascience.com/what-if-i-told-you-database-indexes-could-be-learned-6cf8f59bff94

https://www.arxiv-vanity.com/papers/1712.01208/

https://www.oreilly.com/ideas/how-machine-learning-will-accelerate-data-management-systems

...more

Share E024 – Will machine learning kill traditional database indexes?

Sign up to save your podcasts

E024 – Will machine learning kill traditional database indexes?

E024 – Will machine learning kill traditional database indexes?