Data Engineering Podcast

Building Auditable Spark Pipelines At Capital One


Listen Later

Summary

Spark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large volumes of data Capital One has invested in it for their business needs. In this episode Gokul Prabagaren shares his use for it in calculating your rewards points, including the auditing requirements and how he designed his pipeline to maintain all of the necessary information through a pattern of data enrichment.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
  • Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
  • Your host is Tobias Macey and today I’m interviewing Gokul Prabagaren about how he is using Spark for real-world workflows at Capital One
  • Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by giving an overview of the types of data and workflows that you are responsible for at Capital one?
      • In terms of the three "V"s (Volume, Variety, Velocity), what is the magnitude of the data that you are working with?
      • What are some of the business and regulatory requirements that have to be factored into the solutions that you design?
      • Who are the consumers of the data assets that you are producing?
      • Can you describe the technical elements of the platform that you use for managing your data pipelines?
      • What are the various ways that you are using Spark at Capital One?
      • You wrote a post and presented at the Databricks conference about your experience moving from a data filtering to a data enrichment pattern for segmenting transactions. Can you give some context as to the use case and what your design process was for the initial implementation?
        • What were the shortcomings to that approach/business requirements which led you to refactoring the approach to one that maintained all of the data through the different processing stages?
        • What are some of the impacts on data volumes and processing latencies working with enriched data frames persisted between task steps?
        • What are some of the other optimizations or improvements that you have made to that pipeline since you wrote the post?
        • What are some of the limitations of Spark that you have experienced during your work at Capital One?
          • How have you worked around them?
          • What are the most interesting, innovative, or unexpected ways that you have seen Spark used at Capital One?
          • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data engineering at Capital One?
          • What are some of the upcoming projects that you are focused on/excited for?
            • How has your experience with the filtering vs. enrichment approach influenced your thinking on other projects that you work on?
            • Contact Info
              • @gocool_p on Twitter
              • Parting Question
                • From your perspective, what is the biggest gap in the tooling or technology for data management today?
                • Closing Announcements
                  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
                  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
                  • Links
                    • Apache Spark
                    • Blog Post
                    • Databricks Presentation
                    • Delta Lake
                      • Podcast Episode
                      • The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

                        Support Data Engineering Podcast

                        ...more
                        View all episodesView all episodes
                        Download on the App Store

                        Data Engineering PodcastBy Tobias Macey

                        • 4.5
                        • 4.5
                        • 4.5
                        • 4.5
                        • 4.5

                        4.5

                        142 ratings


                        More shows like Data Engineering Podcast

                        View all
                        The Changelog: Software Development, Open Source by Changelog Media

                        The Changelog: Software Development, Open Source

                        290 Listeners

                        Software Engineering Daily by Software Engineering Daily

                        Software Engineering Daily

                        623 Listeners

                        Talk Python To Me by Michael Kennedy

                        Talk Python To Me

                        584 Listeners

                        Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                        Super Data Science: ML & AI Podcast with Jon Krohn

                        302 Listeners

                        NVIDIA AI Podcast by NVIDIA

                        NVIDIA AI Podcast

                        333 Listeners

                        Practical AI by Practical AI LLC

                        Practical AI

                        204 Listeners

                        AWS Podcast by Amazon Web Services

                        AWS Podcast

                        205 Listeners

                        Last Week in AI by Skynet Today

                        Last Week in AI

                        306 Listeners

                        Dwarkesh Podcast by Dwarkesh Patel

                        Dwarkesh Podcast

                        517 Listeners

                        The Data Engineering Show by The Firebolt Data Bros

                        The Data Engineering Show

                        8 Listeners

                        No Priors: Artificial Intelligence | Technology | Startups by Conviction

                        No Priors: Artificial Intelligence | Technology | Startups

                        130 Listeners

                        Latent Space: The AI Engineer Podcast by swyx + Alessio

                        Latent Space: The AI Engineer Podcast

                        92 Listeners

                        This Day in AI Podcast by Michael Sharkey, Chris Sharkey

                        This Day in AI Podcast

                        228 Listeners

                        The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

                        The AI Daily Brief: Artificial Intelligence News and Analysis

                        630 Listeners

                        AI + a16z by a16z

                        AI + a16z

                        36 Listeners