For the Love of Data

E024 – Will machine learning kill traditional database indexes?


Listen Later

In this episode my friend Vikas Popuri and I chat about Google’s paper comparing ML models to traditional DB indexes.

Background:
  • Google used learned indexes , machine learning models, to access data and compared these to B-Tree, Hash, and Bloom Filter indices
  • Trained a model using multiple stages where the earlier stages could approximate a location and later stages would work with a subset to improve accuracy. Each stage could choose a different model to advance the search further.
  • FYI, the diagram below looks like a decision tree, but it is not. Each stage/model could have different distributions and could repeat the model used above or below.
    • They achieved access time and space savings across the board, even without using GPUs or TPUs (Tensor Processing Units)
    • “Retraining the model” – the tests were performed on a static data set, so no retraining or index maintenance was required.
    • Observations / Questions:
      • Used Tensorflow with Python as the front end — apparently a lot of initial overhead with this as a test stack.
      • B-Tree indexes to some extent are a model, especially if they don’t store every key and instead store the first key in a page.
      • The paper made some rudimentary assumptions, such as using a random hash function.
        • What if the data is not static? How long would it take to retrain the model vs. maintain an index?
        • What if data profiling caused you to index certain attributes and not others?
        • What are the best practices with this newer approach
        • The power of being able to use different models at different stages is intriguing. You could also potentially maintain traditional indexes as a backup / failsafe that would upper bound to the performance of a B-Tree.
        • Load times – The folks from Google commented that they could retrain a simple model on a 200M data set in “just [a] few seconds if implemented in C++”
        • Recursive question: do you need an optimizer to optimize the optimization path?
        • Room for improvement:
          • GPUs/TPUs
          • Incorporating common queries into the model to know what questions people are asking
          • Music

            Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org

            Sources:
            • http://learningsys.org/nips17/assets/papers/paper_22.pdf
            • http://blog.ezyang.com/2017/12/a-machine-learning-approach-to-database-indexes-alex-beutel/
            • https://towardsdatascience.com/what-if-i-told-you-database-indexes-could-be-learned-6cf8f59bff94
            • https://www.arxiv-vanity.com/papers/1712.01208/
            • https://www.oreilly.com/ideas/how-machine-learning-will-accelerate-data-management-systems
            • ...more
              View all episodesView all episodes
              Download on the App Store

              For the Love of DataBy For the Love of Data