In this episode my friend Vikas Popuri and I chat about Google’s paper comparing ML models to traditional DB indexes.
Background:
Google used learned indexes , machine learning models, to access data and compared these to B-Tree, Hash, and Bloom Filter indicesTrained a model using multiple stages where the earlier stages could approximate a location and later stages would work with a subset to improve accuracy. Each stage could choose a different model to advance the search further.FYI, the diagram below looks like a decision tree, but it is not. Each stage/model could have different distributions and could repeat the model used above or below.They achieved access time and space savings across the board, even without using GPUs or TPUs (Tensor Processing Units)“Retraining the model” – the tests were performed on a static data set, so no retraining or index maintenance was required.Observations / Questions:
Used Tensorflow with Python as the front end — apparently a lot of initial overhead with this as a test stack.B-Tree indexes to some extent are a model, especially if they don’t store every key and instead store the first key in a page.The paper made some rudimentary assumptions, such as using a random hash function.What if the data is not static? How long would it take to retrain the model vs. maintain an index?What if data profiling caused you to index certain attributes and not others?What are the best practices with this newer approachThe power of being able to use different models at different stages is intriguing. You could also potentially maintain traditional indexes as a backup / failsafe that would upper bound to the performance of a B-Tree.Load times – The folks from Google commented that they could retrain a simple model on a 200M data set in “just [a] few seconds if implemented in C++”Recursive question: do you need an optimizer to optimize the optimization path?Room for improvement:GPUs/TPUsIncorporating common queries into the model to know what questions people are askingMusic
Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org
Sources:
http://learningsys.org/nips17/assets/papers/paper_22.pdf http://blog.ezyang.com/2017/12/a-machine-learning-approach-to-database-indexes-alex-beutel/ https://towardsdatascience.com/what-if-i-told-you-database-indexes-could-be-learned-6cf8f59bff94 https://www.arxiv-vanity.com/papers/1712.01208/ https://www.oreilly.com/ideas/how-machine-learning-will-accelerate-data-management-systems