Data Engineering Podcast

Build A Data Lake For Your Security Logs With Scanner


Listen Later

Summary

Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Cliff Crosland about Scanner, a security data lake platform for analyzing security logs and identifying issues quickly and cost-effectively
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Scanner is and the story behind it?
      • What were the shortcomings of other tools that are available in the ecosystem?
      • What is Scanner explicitly not trying to solve for in the security space? (e.g. SIEM)
      • A query engine is useless without data to analyze. What are the data acquisition paths/sources that you are designed to work with?- e.g. cloudtrail logs, app logs, etc.
        • What are some of the other sources of signal for security monitoring that would be valuable to incorporate or integrate with through Scanner?
        • Log data is notoriously messy, with no strictly defined format. How do you handle introspection and querying across loosely structured records that might span multiple sources and inconsistent labelling strategies?
        • Can you describe the architecture of the Scanner platform?
          • What were the motivating constraints that led you to your current implementation?
          • How have the design and goals of the product changed since you first started working on it?
          • Given the security oriented customer base that you are targeting, how do you address trust/network boundaries for compliance with regulatory/organizational policies?
          • What are the personas of the end-users for Scanner?
            • How has that influenced the way that you think about the query formats, APIs, user experience etc. for the prroduct?
            • For teams who are working with Scanner can you describe how it fits into their workflow?
            • What are the most interesting, innovative, or unexpected ways that you have seen Scanner used?
            • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Scanner?
            • When is Scanner the wrong choice?
            • What do you have planned for the future of Scanner?
            • Contact Info
              • LinkedIn
              • Parting Question
                • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                • Closing Announcements
                  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
                  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                  • If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                  • Links
                    • Scanner
                    • cURL
                    • Rust
                    • Splunk
                    • S3
                    • AWS Athena
                    • Loki
                    • Snowflake
                      • Podcast Episode
                      • Presto
                      • [Trino](thttps://trino.io/)
                      • AWS CloudTrail
                      • GitHub Audit Logs
                      • Okta
                      • Cribl
                      • Vector.dev
                      • Tines
                      • Torq
                      • Jira
                      • Linear
                      • ECS Fargate
                      • SQS
                      • Monoid
                      • Group Theory
                      • Avro
                      • Parquet
                      • OCSF
                      • VPC Flow Logs
                      • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                        Sponsored By:

                        • Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png)
                        This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics.
                        Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)

                        Support Data Engineering Podcast

                        ...more
                        View all episodesView all episodes
                        Download on the App Store

                        Data Engineering PodcastBy Tobias Macey

                        • 4.6
                        • 4.6
                        • 4.6
                        • 4.6
                        • 4.6

                        4.6

                        134 ratings


                        More shows like Data Engineering Podcast

                        View all
                        Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

                        Software Engineering Radio - the podcast for professional software developers

                        265 Listeners

                        The Changelog: Software Development, Open Source by Changelog Media

                        The Changelog: Software Development, Open Source

                        285 Listeners

                        The Cloudcast by Massive Studios

                        The Cloudcast

                        155 Listeners

                        Thoughtworks Technology Podcast by Thoughtworks

                        Thoughtworks Technology Podcast

                        43 Listeners

                        Data Skeptic by Kyle Polich

                        Data Skeptic

                        475 Listeners

                        Talk Python To Me by Michael Kennedy

                        Talk Python To Me

                        580 Listeners

                        Software Engineering Daily by Software Engineering Daily

                        Software Engineering Daily

                        624 Listeners

                        The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                        The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                        439 Listeners

                        AWS Podcast by Amazon Web Services

                        AWS Podcast

                        203 Listeners

                        Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                        Super Data Science: ML & AI Podcast with Jon Krohn

                        295 Listeners

                        Python Bytes by Michael Kennedy and Brian Okken

                        Python Bytes

                        214 Listeners

                        DataFramed by DataCamp

                        DataFramed

                        266 Listeners

                        Practical AI by Practical AI LLC

                        Practical AI

                        196 Listeners

                        The Stack Overflow Podcast by The Stack Overflow Podcast

                        The Stack Overflow Podcast

                        62 Listeners

                        The Real Python Podcast by Real Python

                        The Real Python Podcast

                        137 Listeners