How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, talks about how machine learning (ML) models understand text.
Jodie explains how ML models require data in a structured format, which involves transforming text documents into columns and rows. She covers the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.
We jump into word embedding models next. Jodie talks about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. Jodie also shares multiple resources to help you continue exploring NLP and modeling.
Course Spotlight: Learn Text Classification With Python and Keras
In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.
00:00:00 – Introduction00:02:47 – Exploring the topic00:06:00 – Perceived sentience of LaMDA 00:10:24 – How do we get started?00:11:16 – What are classification and sentiment analysis?00:13:03 – Transforming text in rows and columns00:14:47 – Sponsor: Snyk00:15:27 – Bag-of-words approach00:19:12 – Stemming and lemmatization00:22:05 – Capturing N-grams00:25:34 – Count vectorization00:27:14 – Stop words00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization00:32:28 – Potential projects for bag-of-words techniques00:34:07 – Video Course Spotlight00:35:20 – WordNet and NLTK package00:37:27 – Word embeddings and word2vec00:45:30 – Previous training and too many dimensions00:50:07 – How to use word2vec and Gensim?00:51:26 – What types of projects for word2vec and Gensim?00:54:41 – Getting into GPT and BERT in another episode00:56:11 – How to follow Jodie’s work?00:57:36 – Thanks and goodbyeWhy Google’s “sentient” AI LaMDA is nothing like a person.On NYT Magazine on AI: Resist the Urge to be Impressed | Emily M. Bender | Medium ELIZA - Wikipediaeliza.py - Python 2 version by Daniel Connellydabraude/Pyliza: Python3 Implementation of Elizamagneticpoetry.comNatural Language Processing With Python’s NLTK Package – Real PythonPractical Text Classification With Python and Keras – Real PythonSentiment Analysis: First Steps With Python’s NLTK Library – Real PythonNLTK: Natural Language ToolkitspaCy · Industrial-strength Natural Language Processing in PythonNatural Language Processing With spaCy in Python - Real PythonStemming - WikipediaLemmatization - WikipediaBinary/Count Vectorization: sklearn.feature_extraction.text.CountVectorizer— scikit-learnTFIDF: sklearn.feature_extraction.text.TfidfVectorizer — scikit-learnPorter Stemmer: nltk.stem.porter module — NLTKSnowball Stemmer: nltk.stem.snowball module — NLTKWordNet Lemmatizer: nltk.stem.wordnet module — NLTKLemmatizer · spaCy API DocumentationApplying Bag of Words and Word2Vec models on Reuters-21578 Dataset Elvin Ouyang’s BlogUCI Machine Learning Repository: Reuters-21578 Text Categorization Collection Data Set The Illustrated Word2vec – Jay AlammarA Complete Guide to Using WordNET in NLP ApplicationsGensim: Topic modeling for humansCore Tutorials — gensimFind Open Datasets and Machine Learning Projects | KaggleEngineering All Hands: Vectorise all the things! - YouTubePyCon Portugal 2022NDC Oslo 2022 | Conference for Software DevelopersJodie Burchell’s Blog - Standard errorJodie Burchell 🇦🇺🇩🇪 (@t_redactyl) / TwitterJetBrains: Essential tools for software developers and teamsLevel up your Python skills with our expert-led courses:
Data Cleaning With pandas and NumPyReading and Writing Files With pandasLearn Text Classification With Python and Keras Support the podcast & join our community of Pythonistas