The Python Podcast.__init__

Unleash The Power Of Dataframes At Any Scale With Modin


Listen Later

Summary

When you start working on a data project there are always a variety of unknown factors that you have to explore. One of those is the volume of total data that you will eventually need to handle, and the speed and scale at which it will need to be processed. If you optimize for scale too early then it adds a high barrier to entry due to the complexities of distributed systems, but if you invest in a lot of engineering up front then it can be challenging to refactor for scale. Modin is a project that aims to remove that decision by letting you seamlessly replace your existing Pandas code and scale across CPU cores or across a cluster of machines. In this episode Devin Petersohn explains why he started working on solving this problem, how Modin is architected to allow for a smooth escalation from small to large volumes of data and compute, and how you can start using it today to accelerate your Pandas workflows.

Announcements
  • Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Your host as usual is Tobias Macey and today I’m interviewing Devin Petersohn about Modin, a Pandas compatible dataframe library for datasets from 1MB to 1TB+
  • Interview
    • Introductions
    • How did you get introduced to Python?
    • Can you describe what Modin is and the story behind it?
      • Why study dataframes?
      • How do dataframes compare to databases?
        • What can you do in a dataframe that you couldn’t in a database?
        • What are your overall goals for the Modin project?
        • Who are the target users of Modin and how does that influence your prioritization of features?
        • What are some of the API inconsistencies that you have had to abstract and work around between Pandas, Ray, and Dask to give users a seamless experience?
        • What are some of the considerations in terms of capabilities or user experience that will influence whether to use Ray or Dask as the execution engine?
        • Can you describe how Modin is implemented?
          • How has the constraint of replicating the Pandas API influenced your architectural choices?
          • What are the most complex or challenging Pandas APIs to replicate in Modin?
          • In addition to the core Pandas API you have also added experimental features such as SQL support and a spreadsheet interface. How have those capabilities affected the range of potential use cases and end users?
          • What are some of the complexities that come from acting as a middleware between the Pandas API and the Ray and Dask frameworks?
          • What are some of the initial ideas or assumptions that you had about the design or utility of Modin that have been challenged as you worked through building and releasing it?
          • What are the most interesting, innovative, or unexpected ways that you have seen Modin used?
          • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Modin?
          • When is Modin the wrong choice?
          • What do you have planned for the future of Modin?
          • Keep In Touch
            • devin-petersohn on GitHub
            • LinkedIn
            • Picks
              • Tobias
                • xxh
                • Devin
                  • Lux
                    • Podcast Episode
                    • Closing Announcements
                      • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
                      • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                      • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                      • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
                      • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
                      • Links
                        • Modin
                        • UC Berkeley
                        • RISELAB
                        • XArray
                        • Pandas
                          • Podcast Episode
                          • Dask
                            • Podcast Episode
                            • Ray
                              • Podcast Episode
                              • Spark
                              • The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

                                ...more
                                View all episodesView all episodes
                                Download on the App Store

                                The Python Podcast.__init__By Tobias Macey

                                • 4.4
                                • 4.4
                                • 4.4
                                • 4.4
                                • 4.4

                                4.4

                                100 ratings


                                More shows like The Python Podcast.__init__

                                View all
                                TED Talks Daily by TED

                                TED Talks Daily

                                11,280 Listeners

                                6 Minute English by BBC Radio

                                6 Minute English

                                1,779 Listeners

                                The Changelog: Software Development, Open Source by Changelog Media

                                The Changelog: Software Development, Open Source

                                285 Listeners

                                Data Skeptic by Kyle Polich

                                Data Skeptic

                                474 Listeners

                                Talk Python To Me by Michael Kennedy

                                Talk Python To Me

                                585 Listeners

                                Software Engineering Daily by Software Engineering Daily

                                Software Engineering Daily

                                630 Listeners

                                The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

                                The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

                                429 Listeners

                                Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                                Super Data Science: ML & AI Podcast with Jon Krohn

                                295 Listeners

                                Python Bytes by Michael Kennedy and Brian Okken

                                Python Bytes

                                212 Listeners

                                Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

                                Syntax - Tasty Web Development Treats

                                984 Listeners

                                DataFramed by DataCamp

                                DataFramed

                                267 Listeners

                                Practical AI by Practical AI LLC

                                Practical AI

                                196 Listeners

                                The Real Python Podcast by Real Python

                                The Real Python Podcast

                                136 Listeners

                                Last Week in AI by Skynet Today

                                Last Week in AI

                                275 Listeners

                                Latent Space: The AI Engineer Podcast by swyx + Alessio

                                Latent Space: The AI Engineer Podcast

                                64 Listeners