The Real Python Podcast

Web Scraping in Python: Tools, Techniques, and Legality


Listen Later

Do you want to get started with web scraping using Python? Are you concerned about the potential legal implications? What are the tools required and what are some of the best practices? This week on the show we have Kimberly Fessel to discuss her excellent tutorial created for PyCon 2020 online titled “It’s Officially Legal so Let’s Scrape the Web.”

We discuss getting started with web scraping, and cover tools and techniques. Kimberly gives advice on finding elements inside of the html, and techniques for cleaning your data. She also notes a recent change to the legal landscape regarding scraping the web.

Kimberly is a Senior Data Scientist at Metis Data Science Bootcamp in New York City. She holds a Ph.D. in applied mathematics. We talk about her switch from academia to data science, and discuss her passion for data storytelling and visualizations.

Course Spotlight: Defining Main Functions in Python

This course will get you up to speed with defining a starting point for the execution of a program, and helps you to understand what goes into the main() function. Prepare for a deep dive as you go through the sections. It’s a worthy investment of your time to understand this vital entry point for your Python scripts and applications!

Topics:

  • 00:00:00 – Introduction
  • 00:01:31 – Kimberly’s background and Metis Data Science Bootcamp
  • 00:02:19 – NLP and work in advertising
  • 00:03:27 – Changes in the legality of web scraping
  • 00:06:12 – What are good projects for web scraping?
  • 00:06:56 – Tools to start web scraping
  • 00:07:51 – How to find the elements you want?
  • 00:09:00 – How much HTML should you know?
  • 00:10:49 – Inspecting elements in the browser
  • 00:14:30 – What are good sites to practice on?
  • 00:16:20 – Pausing between requests
  • 00:19:02 – Saving as you go
  • 00:20:54 – Real Python Video Course Spotlight
  • 00:21:55 – Navigating the DOM
  • 00:23:10 – Data cleaning and formatting
  • 00:28:26 – Dynamic sites and Selenium
  • 00:32:16 – Scrapy
  • 00:33:55 – PyOhio 2020
  • 00:35:40 – Transition out of academia
  • 00:38:40 – What are you excited about in the world of Python?
  • 00:41:05 – What do you want to learn next in Python?
  • 00:48:00 – What is a less known Python tip or trick?
  • 00:49:17 – Thanks and Goodbye
  • Show Links:

    • Kimberly Fessel, PHD - Blog
    • Metis: Data Science Training
    • It’s Officially Legal so Let’s Scrape the Web: PyCon 2020 online - Tutorial
    • Victory! Ruling in hiQ v. Linkedin Protects Scraping of Public Data: EFF.org
    • Computer Fraud and Abuse Act - Wikipedia Article
    • Box Office Mojo
    • Sports Reference | Sports Stats, fast, easy, and up-to-date
    • Springfield! Springfield! - TV & Movie Scripts - Archive.org
    • Jupyter Notebook: An Introduction - Real Python Article
    • The Python pickle Module: How to Persist Objects in Python - Real Python Article
    • A Practical Introduction to Web Scraping in Python - Real Python Article
    • Beautiful Soup: Build a Web Scraper With Python - Real Python Article
    • Making HTTP Requests With Python - Real Python Video Course
    • Natural Language Processing With spaCy in Python - Real Python Article
    • Delorean: Time Travel Made Easy
    • Maya: Datetimes for Humans
    • Regular Expressions: Regexes in Python (Part 1) - Real Python Article
    • Selenium: Automates browsers. That’s it!
    • Scrapy: Framework for extracting the data you need from websites
    • PyOhio 2020
    • ODSC: Open Data Science Conference
    • Slides from Kimberly’s talk - Level Up: Fancy NLP with Straightforward Tools
    • Tonks: A general purpose deep learning library
    • Tonks: Building One (Multi-Task) Model to Rule Them All! - Medium Article
    • Plotly | Dash
    • geoplotlib: Python toolbox for visualizing geographical data and making map
    • GeoPandas: Make working with geospatial data in Python easier
    • Altair: Declarative Visualization in Python
    • Understanding the Transform Function in Pandas: Practical Business Python
    • JavaScript charting detour:

      • Down and Up: A Puzzle Illustrated with D3.js - Kimberly’s blog
      • d3js - Data-Driven Documents
      • Crossfilter: Fast Multidimensional Filtering for Coordinated Views
      • dc.js - Dimensional Charting JavaScript Library
      • Level up your Python skills with our expert-led courses:

        • Defining Main Functions in Python
        • Making HTTP Requests With Python
        • Strings and Character Data in Python
        • Support the podcast & join our community of Pythonistas

          ...more
          View all episodesView all episodes
          Download on the App Store

          The Real Python PodcastBy Real Python

          • 4.7
          • 4.7
          • 4.7
          • 4.7
          • 4.7

          4.7

          136 ratings


          More shows like The Real Python Podcast

          View all
          Software Engineering Radio - the podcast for professional software developers by se-radio@computer.org

          Software Engineering Radio - the podcast for professional software developers

          272 Listeners

          The Changelog: Software Development, Open Source by Changelog Media

          The Changelog: Software Development, Open Source

          283 Listeners

          Thoughtworks Technology Podcast by Thoughtworks

          Thoughtworks Technology Podcast

          41 Listeners

          Talk Python To Me by Michael Kennedy

          Talk Python To Me

          592 Listeners

          Software Engineering Daily by Software Engineering Daily

          Software Engineering Daily

          625 Listeners

          Soft Skills Engineering by Jamison Dance and Dave Smith

          Soft Skills Engineering

          269 Listeners

          Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

          Super Data Science: ML & AI Podcast with Jon Krohn

          298 Listeners

          Python Bytes by Michael Kennedy and Brian Okken

          Python Bytes

          213 Listeners

          Data Engineering Podcast by Tobias Macey

          Data Engineering Podcast

          142 Listeners

          Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

          Syntax - Tasty Web Development Treats

          981 Listeners

          DataFramed by DataCamp

          DataFramed

          266 Listeners

          Kubernetes Podcast from Google by Abdel Sghiouar, Kaslin Fields

          Kubernetes Podcast from Google

          181 Listeners

          Practical AI by Practical AI LLC

          Practical AI

          190 Listeners

          The Stack Overflow Podcast by The Stack Overflow Podcast

          The Stack Overflow Podcast

          64 Listeners

          The Pragmatic Engineer by Gergely Orosz

          The Pragmatic Engineer

          52 Listeners