Data Engineering Podcast

Accelerate Your Machine Learning With The StreamSQL Feature Store


Listen Later

Summary

Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Your host is Tobias Macey and today I’m interviewing Simba Khadder about his views on the importance of ML feature stores, and his experience implementing one at StreamSQL
  • Interview
    • Introduction
    • How did you get involved in the areas of machine learning and data management?
    • What is StreamSQL and what motivated you to start the business?
    • Can you describe what a machine learning feature is?
    • What is the difference between generating features for training a model and generating features for serving?
    • How is feature management typically handled today?
    • What is a feature store and how is it different from the status quo?
    • What is the overall lifecycle of identifying useful features, defining and generating them, using them for training, and then serving them in production?
    • How does the usage of a feature store impact the workflow of ML engineers/data scientists and data engineers?
    • What are the general requirements of a feature store?
    • What additional capabilities or tangential services are necessary for providing a pleasant UX for a feature store?
      • How is discovery and documentation of features handled?
      • What is the current landscape of feature stores and how does StreamSQL compare?
      • How is the StreamSQL feature store implemented?
        • How is the supporting infrastructure architected and how has it evolved since you first began working on it?
        • Why is streaming data such a focal point of feature stores?
        • How do you generate features for training?
        • How do you approach monitoring of features and what does remediation look like for a feature that is no longer valid?
        • How do you handle versioning and deploying features?
        • What’s the process for integrating data sources into StreamSQL for processing into features?
        • How are the features materialized?
        • What are the most challenging or complex aspects of working on or with a feature store?
        • When is StreamSQL the wrong choice for a feature store?
        • What are the most interesting, challenging, or unexpected lessons that you have learned in the process of building StreamSQL?
        • What do you have planned for the future of the product?
        • Contact Info
          • LinkedIn
          • @simba_khadder on Twitter
          • Parting Question
            • From your perspective, what is the biggest gap in the tooling or technology for data management today?
            • Closing Announcements
              • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
              • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
              • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
              • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
              • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
              • Links
                • StreamSQL
                • Feature Stores for ML
                • Distributed Systems
                • Google Cloud Datastore
                • Triton
                • Uber Michelangelo
                • AirBnB Zipline
                • Lyft Dryft
                • Apache Flink
                  • Podcast Episode
                  • Apache Kafka
                  • Spark Streaming
                  • Apache Cassandra
                  • Redis
                  • Apache Pulsar
                    • Podcast Episode
                    • StreamNative Episode
                    • TDD == Test Driven Development
                    • Lyft presentation – Bootstrapping Flink
                    • Go-Jek Feast
                    • Hopsworks
                    • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                      Support Data Engineering Podcast

                      ...more
                      View all episodesView all episodes
                      Download on the App Store

                      Data Engineering PodcastBy Tobias Macey

                      • 4.5
                      • 4.5
                      • 4.5
                      • 4.5
                      • 4.5

                      4.5

                      142 ratings


                      More shows like Data Engineering Podcast

                      View all
                      The Changelog: Software Development, Open Source by Changelog Media

                      The Changelog: Software Development, Open Source

                      289 Listeners

                      Software Engineering Daily by Software Engineering Daily

                      Software Engineering Daily

                      623 Listeners

                      Talk Python To Me by Michael Kennedy

                      Talk Python To Me

                      583 Listeners

                      Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                      Super Data Science: ML & AI Podcast with Jon Krohn

                      302 Listeners

                      NVIDIA AI Podcast by NVIDIA

                      NVIDIA AI Podcast

                      334 Listeners

                      Practical AI by Practical AI LLC

                      Practical AI

                      203 Listeners

                      AWS Podcast by Amazon Web Services

                      AWS Podcast

                      205 Listeners

                      Last Week in AI by Skynet Today

                      Last Week in AI

                      305 Listeners

                      Dwarkesh Podcast by Dwarkesh Patel

                      Dwarkesh Podcast

                      517 Listeners

                      The Data Engineering Show by The Firebolt Data Bros

                      The Data Engineering Show

                      8 Listeners

                      No Priors: Artificial Intelligence | Technology | Startups by Conviction

                      No Priors: Artificial Intelligence | Technology | Startups

                      130 Listeners

                      Latent Space: The AI Engineer Podcast by swyx + Alessio

                      Latent Space: The AI Engineer Podcast

                      92 Listeners

                      This Day in AI Podcast by Michael Sharkey, Chris Sharkey

                      This Day in AI Podcast

                      228 Listeners

                      The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

                      The AI Daily Brief: Artificial Intelligence News and Analysis

                      631 Listeners

                      AI + a16z by a16z

                      AI + a16z

                      36 Listeners