July 29, 2022

Natural Language Processing and How ML Models Understand Text

58 minutes

How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, talks about how machine learning (ML) models understand text.

Jodie explains how ML models require data in a structured format, which involves transforming text documents into columns and rows. She covers the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.

We jump into word embedding models next. Jodie talks about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. Jodie also shares multiple resources to help you continue exploring NLP and modeling.

Course Spotlight: Learn Text Classification With Python and Keras

In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.

Topics:

00:00:00 – Introduction

00:02:47 – Exploring the topic

00:06:00 – Perceived sentience of LaMDA

00:10:24 – How do we get started?

00:11:16 – What are classification and sentiment analysis?

00:13:03 – Transforming text in rows and columns

00:14:47 – Sponsor: Snyk

00:15:27 – Bag-of-words approach

00:19:12 – Stemming and lemmatization

00:22:05 – Capturing N-grams

00:25:34 – Count vectorization

00:27:14 – Stop words

00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization

00:32:28 – Potential projects for bag-of-words techniques

00:34:07 – Video Course Spotlight

00:35:20 – WordNet and NLTK package

00:37:27 – Word embeddings and word2vec

00:45:30 – Previous training and too many dimensions

00:50:07 – How to use word2vec and Gensim?

00:51:26 – What types of projects for word2vec and Gensim?

00:54:41 – Getting into GPT and BERT in another episode

00:56:11 – How to follow Jodie’s work?

00:57:36 – Thanks and goodbye

Show Links:

Why Google’s “sentient” AI LaMDA is nothing like a person.

On NYT Magazine on AI: Resist the Urge to be Impressed | Emily M. Bender | Medium

ELIZA - Wikipedia

eliza.py - Python 2 version by Daniel Connelly

dabraude/Pyliza: Python3 Implementation of Eliza

magneticpoetry.com

Natural Language Processing With Python’s NLTK Package – Real Python

Practical Text Classification With Python and Keras – Real Python

Sentiment Analysis: First Steps With Python’s NLTK Library – Real Python

NLTK: Natural Language Toolkit

spaCy · Industrial-strength Natural Language Processing in Python

Natural Language Processing With spaCy in Python - Real Python

Stemming - Wikipedia

Lemmatization - Wikipedia

Binary/Count Vectorization: sklearn.feature_extraction.text.CountVectorizer— scikit-learn

TFIDF: sklearn.feature_extraction.text.TfidfVectorizer — scikit-learn

Porter Stemmer: nltk.stem.porter module — NLTK

Snowball Stemmer: nltk.stem.snowball module — NLTK

WordNet Lemmatizer: nltk.stem.wordnet module — NLTK

Lemmatizer · spaCy API Documentation

Applying Bag of Words and Word2Vec models on Reuters-21578 Dataset Elvin Ouyang’s Blog

UCI Machine Learning Repository: Reuters-21578 Text Categorization Collection Data Set

The Illustrated Word2vec – Jay Alammar

A Complete Guide to Using WordNET in NLP Applications

Gensim: Topic modeling for humans

Core Tutorials — gensim

Find Open Datasets and Machine Learning Projects | Kaggle

Engineering All Hands: Vectorise all the things! - YouTube

PyCon Portugal 2022

NDC Oslo 2022 | Conference for Software Developers

Jodie Burchell’s Blog - Standard error

Jodie Burchell 🇦🇺🇩🇪 (@t_redactyl) / Twitter

JetBrains: Essential tools for software developers and teams

Level up your Python skills with our expert-led courses:

Data Cleaning With pandas and NumPy

Reading and Writing Files With pandas

Learn Text Classification With Python and Keras

Support the podcast & join our community of Pythonistas

...more

View all episodes

By Real Python

4.7

139139 ratings

July 29, 2022

Natural Language Processing and How ML Models Understand Text

58 minutes

Course Spotlight: Learn Text Classification With Python and Keras

Topics:

00:00:00 – Introduction

00:02:47 – Exploring the topic

00:06:00 – Perceived sentience of LaMDA

00:10:24 – How do we get started?

00:11:16 – What are classification and sentiment analysis?

00:13:03 – Transforming text in rows and columns

00:14:47 – Sponsor: Snyk

00:15:27 – Bag-of-words approach

00:19:12 – Stemming and lemmatization

00:22:05 – Capturing N-grams

00:25:34 – Count vectorization

00:27:14 – Stop words

00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization

00:32:28 – Potential projects for bag-of-words techniques

00:34:07 – Video Course Spotlight

00:35:20 – WordNet and NLTK package

00:37:27 – Word embeddings and word2vec

00:45:30 – Previous training and too many dimensions

00:50:07 – How to use word2vec and Gensim?

00:51:26 – What types of projects for word2vec and Gensim?

00:54:41 – Getting into GPT and BERT in another episode

00:56:11 – How to follow Jodie’s work?

00:57:36 – Thanks and goodbye

Show Links:

Why Google’s “sentient” AI LaMDA is nothing like a person.

On NYT Magazine on AI: Resist the Urge to be Impressed | Emily M. Bender | Medium

ELIZA - Wikipedia

eliza.py - Python 2 version by Daniel Connelly

dabraude/Pyliza: Python3 Implementation of Eliza

magneticpoetry.com

Natural Language Processing With Python’s NLTK Package – Real Python

Practical Text Classification With Python and Keras – Real Python

Sentiment Analysis: First Steps With Python’s NLTK Library – Real Python

NLTK: Natural Language Toolkit

spaCy · Industrial-strength Natural Language Processing in Python

Natural Language Processing With spaCy in Python - Real Python

Stemming - Wikipedia

Lemmatization - Wikipedia

Binary/Count Vectorization: sklearn.feature_extraction.text.CountVectorizer— scikit-learn

TFIDF: sklearn.feature_extraction.text.TfidfVectorizer — scikit-learn

Porter Stemmer: nltk.stem.porter module — NLTK

Snowball Stemmer: nltk.stem.snowball module — NLTK

WordNet Lemmatizer: nltk.stem.wordnet module — NLTK

Lemmatizer · spaCy API Documentation

Applying Bag of Words and Word2Vec models on Reuters-21578 Dataset Elvin Ouyang’s Blog

UCI Machine Learning Repository: Reuters-21578 Text Categorization Collection Data Set

The Illustrated Word2vec – Jay Alammar

A Complete Guide to Using WordNET in NLP Applications

Gensim: Topic modeling for humans

Core Tutorials — gensim

Find Open Datasets and Machine Learning Projects | Kaggle

Engineering All Hands: Vectorise all the things! - YouTube

PyCon Portugal 2022

NDC Oslo 2022 | Conference for Software Developers

Jodie Burchell’s Blog - Standard error

Jodie Burchell 🇦🇺🇩🇪 (@t_redactyl) / Twitter

JetBrains: Essential tools for software developers and teams

Level up your Python skills with our expert-led courses:

Data Cleaning With pandas and NumPy

Reading and Writing Files With pandas

Learn Text Classification With Python and Keras

Support the podcast & join our community of Pythonistas

...more

More shows like The Real Python Podcast

View all

The Changelog: Software Development, Open Source

288 Listeners

Software Engineering Daily

625 Listeners

Talk Python To Me

579 Listeners

Soft Skills Engineering

289 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn

302 Listeners

Python Bytes

213 Listeners

Syntax - Tasty Web Development Treats

988 Listeners

Darknet Diaries

8,088 Listeners

Tech Brew Ride Home

969 Listeners

Practical AI

200 Listeners

AWS Podcast

207 Listeners

Django Chat

75 Listeners

Last Week in AI

310 Listeners

Machine Learning Street Talk (MLST)

100 Listeners

The Pragmatic Engineer

70 Listeners

Share Natural Language Processing and How ML Models Understand Text

Sign up to save your podcasts

Natural Language Processing and How ML Models Understand Text

Natural Language Processing and How ML Models Understand Text

More shows like The Real Python Podcast

The Changelog: Software Development, Open Source

Software Engineering Daily

Talk Python To Me

Soft Skills Engineering

Super Data Science: ML & AI Podcast with Jon Krohn

Python Bytes

Syntax - Tasty Web Development Treats

Darknet Diaries

Tech Brew Ride Home

Practical AI

AWS Podcast

Django Chat

Last Week in AI

Machine Learning Street Talk (MLST)

The Pragmatic Engineer