Summary
Being able to understand the context of a piece of text is generally thought to be the domain of human intelligence. However, topic modeling and semantic analysis can be used to allow a computer to determine whether different messages and articles are about the same thing. This week we spoke with Radim Řehůřek about his work on GenSim, which is a Python library for performing unsupervised analysis of unstructured text and applying machine learning models to the problem of natural language understanding.
Brief Introduction
Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.comLinode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next projectWe are also sponsored by Sentry this week. Stop hoping your users will report bugs. Sentry’s real-time tracking gives you insight into production deployments and information to reproduce and fix crashes. Check them out at getsentry.com and use the code podcastinit at signup to get a $50 credit on your account.Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workersJoin our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.Your hosts as usual are Tobias Macey and Chris PattiToday we’re interviewing Radim Řehůřek about Gensim, a library for topic modeling and semantic analysis of natural language.Interview with Radim Řehůřek
IntroductionsHow did you get introduced to Python? – ChrisCan you start by giving us an explanation of topic modeling and semantic analysis? – TobiasWhat is Gensim and what inspired you to create it? – TobiasWhat facilities does Gensim provide to simplify the work of this kind of language analysis? – TobiasCan you describe the features that set it apart from other projects such as the NLTK or Spacy? – TobiasWhat are some of the practical applications that Gensim can be used for? – TobiasOne of the features that stuck out to me is the fact that Gensim can process corpora on disk that would be too large to fit into memory. Can you explain some of the algorithmic work that was necessary to allow for this streaming process to be possible? – TobiasGiven that it can handle streams of data, could it also be used in the context of something like Spark? – TobiasGensim also supports unsupervised model building. What kinds of limitations does this have and when would you need a human in the loop? – Tobias Once a model has been trained, how does it get saved and reloaded for subsequent use? – TobiasWhat are some of the more unorthodox or interesting uses people have put Gensim to that you’ve heard about? – Chris
In addition to your work on Gensim, and partly due to its popularity, you have started a consultancy for customers who are interested in improving their data analysis capabilities. How does that feed back into Gensim? – Tobias
Are there any improvements in Gensim or other libraries that you have made available as a result of issues that have come up during client engagements? – Tobias
Is it difficult to find contributors to Gensim because of its advanced nature? – Tobias
Are there any resources you’d like to recommend our listeners explore to get a more in depth understanding of topic modeling and related techniques? – Chris
Keep In Touch
RaRe TechnologiesTwitterEmailGithubMailing ListPicks
TobiasDark Matter and the Dinosaurs by Lisa Randallm-cli1177 BC: The Year Civilization CollapsedLinks
Nadia EghbalGensimSQL AddictNLTKSpacyLatent Dirichlet Allocation (LDA)LSIKeynote in Italy on distributed processingGoogle Scholar references for GensimStylometric analysisOn Writing WellStudent IncubatorWikipedia on topic modelingThe intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA