Do you want to get started with web scraping using Python? Are you concerned about the potential legal implications? What are the tools required and what are some of the best practices? This week on the show we have Kimberly Fessel to discuss her excellent tutorial created for PyCon 2020 online titled “It’s Officially Legal so Let’s Scrape the Web.”
We discuss getting started with web scraping, and cover tools and techniques. Kimberly gives advice on finding elements inside of the html, and techniques for cleaning your data. She also notes a recent change to the legal landscape regarding scraping the web.
Kimberly is a Senior Data Scientist at Metis Data Science Bootcamp in New York City. She holds a Ph.D. in applied mathematics. We talk about her switch from academia to data science, and discuss her passion for data storytelling and visualizations.
Course Spotlight: Defining Main Functions in Python
This course will get you up to speed with defining a starting point for the execution of a program, and helps you to understand what goes into the main() function. Prepare for a deep dive as you go through the sections. It’s a worthy investment of your time to understand this vital entry point for your Python scripts and applications!
00:00:00 – Introduction00:01:31 – Kimberly’s background and Metis Data Science Bootcamp00:02:19 – NLP and work in advertising00:03:27 – Changes in the legality of web scraping00:06:12 – What are good projects for web scraping?00:06:56 – Tools to start web scraping00:07:51 – How to find the elements you want?00:09:00 – How much HTML should you know?00:10:49 – Inspecting elements in the browser00:14:30 – What are good sites to practice on?00:16:20 – Pausing between requests00:19:02 – Saving as you go00:20:54 – Real Python Video Course Spotlight 00:21:55 – Navigating the DOM00:23:10 – Data cleaning and formatting00:28:26 – Dynamic sites and Selenium00:32:16 – Scrapy00:33:55 – PyOhio 202000:35:40 – Transition out of academia00:38:40 – What are you excited about in the world of Python?00:41:05 – What do you want to learn next in Python?00:48:00 – What is a less known Python tip or trick? 00:49:17 – Thanks and GoodbyeKimberly Fessel, PHD - BlogMetis: Data Science TrainingIt’s Officially Legal so Let’s Scrape the Web: PyCon 2020 online - TutorialVictory! Ruling in hiQ v. Linkedin Protects Scraping of Public Data: EFF.orgComputer Fraud and Abuse Act - Wikipedia ArticleBox Office MojoSports Reference | Sports Stats, fast, easy, and up-to-dateSpringfield! Springfield! - TV & Movie Scripts - Archive.orgJupyter Notebook: An Introduction - Real Python ArticleThe Python pickle Module: How to Persist Objects in Python - Real Python ArticleA Practical Introduction to Web Scraping in Python - Real Python ArticleBeautiful Soup: Build a Web Scraper With Python - Real Python ArticleMaking HTTP Requests With Python - Real Python Video CourseNatural Language Processing With spaCy in Python - Real Python ArticleDelorean: Time Travel Made EasyMaya: Datetimes for HumansRegular Expressions: Regexes in Python (Part 1) - Real Python ArticleSelenium: Automates browsers. That’s it!Scrapy: Framework for extracting the data you need from websitesPyOhio 2020ODSC: Open Data Science ConferenceSlides from Kimberly’s talk - Level Up: Fancy NLP with Straightforward ToolsTonks: A general purpose deep learning libraryTonks: Building One (Multi-Task) Model to Rule Them All! - Medium ArticlePlotly | Dash geoplotlib: Python toolbox for visualizing geographical data and making mapGeoPandas: Make working with geospatial data in Python easierAltair: Declarative Visualization in PythonUnderstanding the Transform Function in Pandas: Practical Business PythonJavaScript charting detour:
Down and Up: A Puzzle Illustrated with D3.js - Kimberly’s blogd3js - Data-Driven DocumentsCrossfilter: Fast Multidimensional Filtering for Coordinated Viewsdc.js - Dimensional Charting JavaScript LibraryLevel up your Python skills with our expert-led courses:
Defining Main Functions in PythonMaking HTTP Requests With PythonStrings and Character Data in Python Support the podcast & join our community of Pythonistas