Data Engineering Podcast

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection


Listen Later

Summary

Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Paul Brebner about his experience designing and building a scalable, real-time anomaly detection system using Kafka and Cassandra
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by describing the problem that you were trying to solve and the requirements that you were aiming for?
      • What are some example cases where anomaly detection is useful or necessary?
      • Once you had established the requirements in terms of functionality and data volume, what was your approach for determining the target architecture?
      • What was your selection criteria for the various components of your system design?
        • What tools and technologies did you consider in your initial assessment and which did you ultimately converge on?
          • If you were to start over today would you do any of it differently?
          • Can you talk through the algorithm that you used for detecting anomalous activity?
            • What is the size/duration of the window within which you can effectively characterize trends and how do you collapse it down to a tractable search space?
            • What were you using as a data source, and if it was synthetic how did you handle introducing anomalies in a realistic fashion?
            • What were the main scalability bottlenecks that you encountered as you began ramping up the volume of data and the number of instances?
              • How did those bottlenecks differ as you moved through different levels of scale?
              • What were your assumptions going into this project and how accurate were they as you began testing and scaling the system that you built?
              • What were some of the most interesting or unexpected lessons that you learned in the process of building this anomaly detection system?
              • How have those lessons fed back to your work at Instaclustr?
              • Contact Info
                • LinkedIn
                • @paulbrebner_ on Twitter
                • Parting Question
                  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                  • Links
                    • Instaclustr
                    • Kafka
                    • Cassandra
                    • Canberra, Australia
                    • Spark
                    • Anomaly Detection
                    • Kubernetes
                    • Prometheus
                    • OpenTracing
                    • Jaeger
                    • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                      Support Data Engineering Podcast

                      ...more
                      View all episodesView all episodes
                      Download on the App Store

                      Data Engineering PodcastBy Tobias Macey

                      • 4.5
                      • 4.5
                      • 4.5
                      • 4.5
                      • 4.5

                      4.5

                      142 ratings


                      More shows like Data Engineering Podcast

                      View all
                      The Changelog: Software Development, Open Source by Changelog Media

                      The Changelog: Software Development, Open Source

                      289 Listeners

                      Software Engineering Daily by Software Engineering Daily

                      Software Engineering Daily

                      623 Listeners

                      Talk Python To Me by Michael Kennedy

                      Talk Python To Me

                      583 Listeners

                      Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                      Super Data Science: ML & AI Podcast with Jon Krohn

                      302 Listeners

                      NVIDIA AI Podcast by NVIDIA

                      NVIDIA AI Podcast

                      334 Listeners

                      Practical AI by Practical AI LLC

                      Practical AI

                      203 Listeners

                      AWS Podcast by Amazon Web Services

                      AWS Podcast

                      205 Listeners

                      Last Week in AI by Skynet Today

                      Last Week in AI

                      305 Listeners

                      Dwarkesh Podcast by Dwarkesh Patel

                      Dwarkesh Podcast

                      517 Listeners

                      The Data Engineering Show by The Firebolt Data Bros

                      The Data Engineering Show

                      8 Listeners

                      No Priors: Artificial Intelligence | Technology | Startups by Conviction

                      No Priors: Artificial Intelligence | Technology | Startups

                      130 Listeners

                      Latent Space: The AI Engineer Podcast by swyx + Alessio

                      Latent Space: The AI Engineer Podcast

                      92 Listeners

                      This Day in AI Podcast by Michael Sharkey, Chris Sharkey

                      This Day in AI Podcast

                      228 Listeners

                      The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

                      The AI Daily Brief: Artificial Intelligence News and Analysis

                      631 Listeners

                      AI + a16z by a16z

                      AI + a16z

                      36 Listeners