The Real Python Podcast

Natural Language Processing and How ML Models Understand Text


Listen Later

How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, talks about how machine learning (ML) models understand text.

Jodie explains how ML models require data in a structured format, which involves transforming text documents into columns and rows. She covers the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.

We jump into word embedding models next. Jodie talks about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. Jodie also shares multiple resources to help you continue exploring NLP and modeling.

Course Spotlight: Learn Text Classification With Python and Keras

In this course, you’ll learn about Python text classification with Keras, working your way from a bag-of-words model with logistic regression to more advanced methods, such as convolutional neural networks. You’ll see how you can use pretrained word embeddings, and you’ll squeeze more performance out of your model through hyperparameter optimization.

Topics:

  • 00:00:00 – Introduction
  • 00:02:47 – Exploring the topic
  • 00:06:00 – Perceived sentience of LaMDA
  • 00:10:24 – How do we get started?
  • 00:11:16 – What are classification and sentiment analysis?
  • 00:13:03 – Transforming text in rows and columns
  • 00:14:47 – Sponsor: Snyk
  • 00:15:27 – Bag-of-words approach
  • 00:19:12 – Stemming and lemmatization
  • 00:22:05 – Capturing N-grams
  • 00:25:34 – Count vectorization
  • 00:27:14 – Stop words
  • 00:28:46 – Text Frequency / Inverse Document Frequency (TFIDF) vectorization
  • 00:32:28 – Potential projects for bag-of-words techniques
  • 00:34:07 – Video Course Spotlight
  • 00:35:20 – WordNet and NLTK package
  • 00:37:27 – Word embeddings and word2vec
  • 00:45:30 – Previous training and too many dimensions
  • 00:50:07 – How to use word2vec and Gensim?
  • 00:51:26 – What types of projects for word2vec and Gensim?
  • 00:54:41 – Getting into GPT and BERT in another episode
  • 00:56:11 – How to follow Jodie’s work?
  • 00:57:36 – Thanks and goodbye
  • Show Links:

    • Why Google’s “sentient” AI LaMDA is nothing like a person.
    • On NYT Magazine on AI: Resist the Urge to be Impressed | Emily M. Bender | Medium
    • ELIZA - Wikipedia
    • eliza.py - Python 2 version by Daniel Connelly
    • dabraude/Pyliza: Python3 Implementation of Eliza
    • magneticpoetry.com
    • Natural Language Processing With Python’s NLTK Package – Real Python
    • Practical Text Classification With Python and Keras – Real Python
    • Sentiment Analysis: First Steps With Python’s NLTK Library – Real Python
    • NLTK: Natural Language Toolkit
    • spaCy · Industrial-strength Natural Language Processing in Python
    • Natural Language Processing With spaCy in Python - Real Python
    • Stemming - Wikipedia
    • Lemmatization - Wikipedia
    • Binary/Count Vectorization: sklearn.feature_extraction.text.CountVectorizer— scikit-learn
    • TFIDF: sklearn.feature_extraction.text.TfidfVectorizer — scikit-learn
    • Porter Stemmer: nltk.stem.porter module — NLTK
    • Snowball Stemmer: nltk.stem.snowball module — NLTK
    • WordNet Lemmatizer: nltk.stem.wordnet module — NLTK
    • Lemmatizer · spaCy API Documentation
    • Applying Bag of Words and Word2Vec models on Reuters-21578 Dataset Elvin Ouyang’s Blog
    • UCI Machine Learning Repository: Reuters-21578 Text Categorization Collection Data Set
    • The Illustrated Word2vec – Jay Alammar
    • A Complete Guide to Using WordNET in NLP Applications
    • Gensim: Topic modeling for humans
    • Core Tutorials — gensim
    • Find Open Datasets and Machine Learning Projects | Kaggle
    • Engineering All Hands: Vectorise all the things! - YouTube
    • PyCon Portugal 2022
    • NDC Oslo 2022 | Conference for Software Developers
    • Jodie Burchell’s Blog - Standard error
    • Jodie Burchell 🇦🇺🇩🇪 (@t_redactyl) / Twitter
    • JetBrains: Essential tools for software developers and teams
    • Level up your Python skills with our expert-led courses:

      • Data Cleaning With pandas and NumPy
      • Reading and Writing Files With pandas
      • Learn Text Classification With Python and Keras
      • Support the podcast & join our community of Pythonistas

        ...more
        View all episodesView all episodes
        Download on the App Store

        The Real Python PodcastBy Real Python

        • 4.7
        • 4.7
        • 4.7
        • 4.7
        • 4.7

        4.7

        138 ratings


        More shows like The Real Python Podcast

        View all
        Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

        Software Engineering Radio - the podcast for professional software developers

        271 Listeners

        The Changelog: Software Development, Open Source by Changelog Media

        The Changelog: Software Development, Open Source

        284 Listeners

        Thoughtworks Technology Podcast by Thoughtworks

        Thoughtworks Technology Podcast

        41 Listeners

        Talk Python To Me by Michael Kennedy

        Talk Python To Me

        583 Listeners

        Software Engineering Daily by Software Engineering Daily

        Software Engineering Daily

        624 Listeners

        Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

        Super Data Science: ML & AI Podcast with Jon Krohn

        297 Listeners

        Python Bytes by Michael Kennedy and Brian Okken

        Python Bytes

        214 Listeners

        Data Engineering Podcast by Tobias Macey

        Data Engineering Podcast

        141 Listeners

        Machine Learning Guide by OCDevel

        Machine Learning Guide

        770 Listeners

        Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

        Syntax - Tasty Web Development Treats

        986 Listeners

        CoRecursive: Coding Stories by Adam Gordon Bell - Software Developer

        CoRecursive: Coding Stories

        190 Listeners

        DataFramed by DataCamp

        DataFramed

        271 Listeners

        Practical AI by Practical AI LLC

        Practical AI

        188 Listeners

        The Stack Overflow Podcast by The Stack Overflow Podcast

        The Stack Overflow Podcast

        63 Listeners

        The Pragmatic Engineer by Gergely Orosz

        The Pragmatic Engineer

        63 Listeners