The Python Podcast.__init__

An Open Source Toolchain For Natural Language Processing From Explosion AI


Listen Later

Summary

The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of SpaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.

Announcements
  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. And now, the events are coming to you, with no travel necessary! We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference on April 6th and ODSC East which has also gone virtual starting April 16th. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Matthew Honnibal about the Thinc and Prodigy tools and an update on SpaCy
  • Interview
    • Introductions
    • How did you get introduced to Python?
    • Can you start by giving an overview of your mission at Explosion?
    • We spoke previously about your work on SpaCy. What has changed in the past 3 1/2 years?
      • How have recent innovations in language models such as BERT and GPT-2 influenced the direction or implementation of the project?
      • When I last looked SpaCy only supported English and German, but you have added several new languages. What are the most challenging aspects of building the additional models?
        • What would be required for supporting symbolic or right-to-left languages?
        • How has the ecosystem for language processing in Python shifted or evolved since you first introduced SpaCy?
        • Another project that you have released is Prodigy to support labelling of datasets. Can you talk through the motivation for creating it and describe the workflow for someone using it?
          • What was lacking in the other annotation tools that you have worked with that you are trying to solve for in Prodigy?
          • What are some of the most challenging or problematic aspects of labelling data sets for use in machine learning projects?
            • What is a typical scale of data that can be reasonably handled by an individual or small team working with Prodigy?
              • At what point do you find that it makes sense to use a labeling service rather than generating the labels yourself?
              • Your most recent project is Thinc for building and using deep learning models. What was the motivation for creating it and what problem does it solve in the ecosystem?
                • How does its design and usage compare to other deep learning frameworks such as PyTorch and Tensorflow?
                • How does it compare to projects such as Keras that abstract across those frameworks?
                • How do the SpaCy, Prodigy, and Thinc libraries work together?
                • What are some of the biggest challenges that you are facing in building open source tools to meet the needs of data scientists and machine learning engineers?
                • What are some of the most interesting or impressive projects that you have seen built with the tools your team is creating?
                • What do you have planned for the future of Explosion, SpaCy, Prodigy, and Thinc?
                • Keep In Touch
                  • LinkedIn
                  • @honnibal on Twitter
                  • honnibal on GitHub
                  • Picks
                    • Tobias
                      • Onward movie
                      • Matthew
                        • Coronavirus Preparedness
                        • Ray
                        • Closing Announcements
                          • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
                          • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
                          • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
                          • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
                          • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
                          • Links
                            • Explosion AI
                            • SpaCy
                              • Podcast Episode
                              • Thinc
                              • Prodigy
                              • Natural Language Processing
                              • Perl
                              • NLTK
                              • GPU == Graphics Processing Unit
                              • TPU == Tensor Processing Unit
                              • Transfer Learning
                              • Airflow
                              • Luigi
                              • Perceptron
                              • PyTorch
                              • Tensorflow
                              • Functional Programming
                              • MxNet
                              • Keras
                              • Cuda
                              • C Language
                              • Continuous Integration
                              • Blackstone
                              • Allen AI Institute
                              • SciSpaCy
                              • Holmes
                              • Sense2Vec
                              • FastAPI
                              • The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

                                ...more
                                View all episodesView all episodes
                                Download on the App Store

                                The Python Podcast.__init__By Tobias Macey

                                • 4.4
                                • 4.4
                                • 4.4
                                • 4.4
                                • 4.4

                                4.4

                                100 ratings


                                More shows like The Python Podcast.__init__

                                View all
                                The Changelog: Software Development, Open Source by Changelog Media

                                The Changelog: Software Development, Open Source

                                284 Listeners

                                Data Skeptic by Kyle Polich

                                Data Skeptic

                                476 Listeners

                                Talk Python To Me by Michael Kennedy

                                Talk Python To Me

                                583 Listeners

                                Software Engineering Daily by Software Engineering Daily

                                Software Engineering Daily

                                624 Listeners

                                Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

                                Super Data Science: ML & AI Podcast with Jon Krohn

                                297 Listeners

                                Python Bytes by Michael Kennedy and Brian Okken

                                Python Bytes

                                214 Listeners

                                Data Engineering Podcast by Tobias Macey

                                Data Engineering Podcast

                                141 Listeners

                                The Daily by The New York Times

                                The Daily

                                110,655 Listeners

                                Machine Learning Guide by OCDevel

                                Machine Learning Guide

                                770 Listeners

                                Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

                                Syntax - Tasty Web Development Treats

                                986 Listeners

                                Darknet Diaries by Jack Rhysider

                                Darknet Diaries

                                7,945 Listeners

                                DataFramed by DataCamp

                                DataFramed

                                271 Listeners

                                Practical AI by Practical AI LLC

                                Practical AI

                                188 Listeners

                                The Real Python Podcast by Real Python

                                The Real Python Podcast

                                140 Listeners

                                岩中花述 by GIADA | JustPod

                                岩中花述

                                262 Listeners