Week 6
Web Scraping

Oct 12, 2022

Housekeeping¶

  • Homework #3 (required) due on Monday (10/17)
  • Homework #4 (optional) assigned 10/17, due in two weeks
  • You must complete one of homeworks #4, #5, and #6
  • Final project due at the end of the finals period...more details coming soon

Week 6 agenda: web scraping¶

Last time:

  • Why web scraping?
  • Getting familiar with the Web
  • Web scraping: extracting data from static sites

Today:

  • Practice with web scraping
  • How to deal with dynamic content
In [1]:
# Start with the usual imports
# We'll use these throughout
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

from bs4 import BeautifulSoup
import requests

Scraping: Adding the User-Agent Header¶

Many websites will reject a requests.get() function call if you do not specify the User-Agent header as part of your GET request. This lets the website identify who is making the GET request. You can find your browser's User-Agent value in the "Network" tab of your browser's developer tools. If you click on any request listed on this tab, and go to the "Headers" tab, you should see the "user-agent" value listed:

Screen Shot 2022-10-11 at 10.06.56 PM.png

Example: Let's get COVID-19 stats in Philadelphia¶

In [2]:
url = "https://www.phila.gov/programs/coronavirus-disease-2019-covid-19/updates/"

Get this from the browser:

In [3]:
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.37"
In [4]:
result = requests.get(url, headers={"User-Agent": user_agent}) # NEW: Specify the "User-Agent" header
soup = BeautifulSoup(result.content, "html.parser")

Get the average case count¶

Use the web inspector to identify the correct CSS selector (right click -> Inspect and then right click -> Copy -> Copy Selector)

In [5]:
selector = "#post-263624 > div.one-quarter-layout > div:nth-child(1) > div.medium-18.columns.pbxl > ul > li:nth-child(1)"

Select the element using the CSS selector and get the text:

In [6]:
avg = soup.select_one(selector).text
In [7]:
avg
Out[7]:
'Average new cases per day: 177'

Split the string into words:

In [8]:
words = avg.split()

words
Out[8]:
['Average', 'new', 'cases', 'per', 'day:', '177']

Get the last element and convert to an integer:

In [9]:
int(words[-1])
Out[9]:
177

Get the last updated date¶

In [10]:
selector = "#post-263624 > div.one-quarter-layout > div:nth-child(1) > div.medium-18.columns.pbxl > p:nth-child(3) > em"
In [11]:
last_updated = soup.select_one(selector).text

last_updated
Out[11]:
'Cases last updated: October 4, 2022\nHospitalizations last updated: September 28, 2022'
In [12]:
print(last_updated)
Cases last updated: October 4, 2022
Hospitalizations last updated: September 28, 2022

Break into lines:

In [13]:
lines = last_updated.splitlines()

lines
Out[13]:
['Cases last updated: October 4, 2022',
 'Hospitalizations last updated: September 28, 2022']

Split by the colon:

In [14]:
lines[0].split(":")
Out[14]:
['Cases last updated', ' October 4, 2022']
In [15]:
last_updated_date = lines[0].split(":")[-1]

last_updated_date
Out[15]:
' October 4, 2022'

Convert to a timestamp:

In [16]:
timestamp = pd.to_datetime(last_updated_date)

timestamp
Out[16]:
Timestamp('2022-10-04 00:00:00')
In [17]:
timestamp.strftime("%B %-d, %Y")
Out[17]:
'October 4, 2022'
In [18]:
timestamp.strftime("%m/%d/%y")
Out[18]:
'10/04/22'

Part 1: Web scraping exercises¶

Even more: 101 Web Scraping Exercises

For each of the exercises, use the Web Inspector to inspect the structure of the relevant web page, and identify the HTML content you will need to scrape with Python.

1. The number of days until the General Election¶

  • Relevant URL: https://vote.phila.gov/

Hints:

  • Select the element that holds the number of days
  • You will need to specify the "User-Agent" header, otherwise you will get a 403 Forbidden error
In [52]:
# Initialize the soup for this page
url = "https://vote.phila.gov"
r = requests.get(
    url,
    headers={"User-Agent": user_agent},
)  # Add the user-agent

soup2 = BeautifulSoup(r.content, "html.parser")
In [54]:
# Select the h1 element holding the number of days
# You can get this via the Web Inspector
selector = ".day-count"
In [55]:
# Select the element (use select_one to get only the first match)
element = soup2.select_one(selector)
In [56]:
# Get the text of the element
days = element.text
print("Raw days value = ", days)
Raw days value =  26
In [57]:
# Convert to float
days = float(days)
print(f"Number of days until General Election = {days}")
Number of days until General Election = 26.0

2. Philadelphia City Council¶

A number of councilmembers have resigned in order for them to run for mayor in the spring. Let's find out how many seats are on Council and how many are currently vacant!

Determine two things:

  • The total number of City Council seats
  • The total number of vacant City Council seats

Relevant URL: https://phlcouncil.com/council-members/

Hints:

  • You will need to specify the "User-Agent" header, otherwise you will get a 403 Forbidden error
  • The cards on the page flip, and the Councilmember names are listed on the front AND back. The content is separated and you should see "div" elements with "front" and "back" classes.
  • When creating your CSS selector, use a nested selector that first selects the front content and then selects the name displayed on the card
In [58]:
# Make the request
url = "https://phlcouncil.com/council-members/"
r = requests.get(
    url, headers={"User-Agent": user_agent}
)  # NOTE: include the user agent! Otherwise you get a 403 Forbidden error


# Parse the html
soup3 = BeautifulSoup(r.content, "html.parser")

If you do just .x-face-title, you get duplicates for front and back of the card!

In [61]:
soup3.select(".x-face-title")
Out[61]:
[<h4 class="x-face-title"><strong>Darrell L. Clarke</strong></h4>,
 <h4 class="x-face-title">Council President Darrell L. Clarke</h4>,
 <h4 class="x-face-title"><strong>Mark Squilla</strong></h4>,
 <h4 class="x-face-title">Mark Squilla</h4>,
 <h4 class="x-face-title"><strong>Kenyatta Johnson</strong></h4>,
 <h4 class="x-face-title">Kenyatta Johnson</h4>,
 <h4 class="x-face-title"><strong>Jamie Gauthier</strong></h4>,
 <h4 class="x-face-title">Jamie Gauthier</h4>,
 <h4 class="x-face-title"><strong>Curtis Jones, Jr.</strong></h4>,
 <h4 class="x-face-title">Curtis Jones, Jr.</h4>,
 <h4 class="x-face-title"><strong>Michael Driscoll</strong></h4>,
 <h4 class="x-face-title">MICHAEL DRISCOLL</h4>,
 <h4 class="x-face-title"><strong>Vacant</strong></h4>,
 <h4 class="x-face-title">Vacant</h4>,
 <h4 class="x-face-title"><strong>Cindy Bass</strong></h4>,
 <h4 class="x-face-title">Cindy Bass</h4>,
 <h4 class="x-face-title"><strong>vacant</strong></h4>,
 <h4 class="x-face-title">vacant</h4>,
 <h4 class="x-face-title"><strong>Brian J. O’Neill</strong></h4>,
 <h4 class="x-face-title">Brian J. O’Neill</h4>,
 <h4 class="x-face-title"><strong>Kendra Brooks</strong></h4>,
 <h4 class="x-face-title">Kendra Brooks</h4>,
 <h4 class="x-face-title"><strong>VACANt</strong></h4>,
 <h4 class="x-face-title">Vacant</h4>,
 <h4 class="x-face-title"><strong>Vacant</strong></h4>,
 <h4 class="x-face-title">Vacant</h4>,
 <h4 class="x-face-title"><strong>Katherine Gilmore Richardson</strong></h4>,
 <h4 class="x-face-title">Katherine Gilmore Richardson</h4>,
 <h4 class="x-face-title"><strong>Helen Gym</strong></h4>,
 <h4 class="x-face-title">Helen Gym</h4>,
 <h4 class="x-face-title"><strong>David Oh</strong></h4>,
 <h4 class="x-face-title">David Oh</h4>,
 <h4 class="x-face-title"><strong>Isaiah Thomas</strong></h4>,
 <h4 class="x-face-title">Isaiah Thomas</h4>]

Add the ".x-face-outer.front" classes to select just the x-face-title elements on the front of the card!

In [62]:
selector = '.x-face-outer.front .x-face-title'
In [63]:
name_elements = soup3.select(selector)
In [64]:
name_elements
Out[64]:
[<h4 class="x-face-title"><strong>Darrell L. Clarke</strong></h4>,
 <h4 class="x-face-title"><strong>Mark Squilla</strong></h4>,
 <h4 class="x-face-title"><strong>Kenyatta Johnson</strong></h4>,
 <h4 class="x-face-title"><strong>Jamie Gauthier</strong></h4>,
 <h4 class="x-face-title"><strong>Curtis Jones, Jr.</strong></h4>,
 <h4 class="x-face-title"><strong>Michael Driscoll</strong></h4>,
 <h4 class="x-face-title"><strong>Vacant</strong></h4>,
 <h4 class="x-face-title"><strong>Cindy Bass</strong></h4>,
 <h4 class="x-face-title"><strong>vacant</strong></h4>,
 <h4 class="x-face-title"><strong>Brian J. O’Neill</strong></h4>,
 <h4 class="x-face-title"><strong>Kendra Brooks</strong></h4>,
 <h4 class="x-face-title"><strong>VACANt</strong></h4>,
 <h4 class="x-face-title"><strong>Vacant</strong></h4>,
 <h4 class="x-face-title"><strong>Katherine Gilmore Richardson</strong></h4>,
 <h4 class="x-face-title"><strong>Helen Gym</strong></h4>,
 <h4 class="x-face-title"><strong>David Oh</strong></h4>,
 <h4 class="x-face-title"><strong>Isaiah Thomas</strong></h4>]
In [65]:
print(f"Total number of city councilmembers is {len(name_elements)}")
Total number of city councilmembers is 17
In [66]:
names = [el.text.strip().lower() for el in name_elements]
In [67]:
names
Out[67]:
['darrell l. clarke',
 'mark squilla',
 'kenyatta johnson',
 'jamie gauthier',
 'curtis jones, jr.',
 'michael driscoll',
 'vacant',
 'cindy bass',
 'vacant',
 'brian j. o’neill',
 'kendra brooks',
 'vacant',
 'vacant',
 'katherine gilmore richardson',
 'helen gym',
 'david oh',
 'isaiah thomas']

Find which names equal vacant:

In [68]:
[name == 'vacant' for name in names]
Out[68]:
[False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False]

Count the number of vacants!

In [69]:
sum([name == 'vacant' for name in names])
Out[69]:
4

Derek Green, Maria Quiñones-Sánchez, Allan Domb, and Cherelle Parker have all resigned with the expectation that they will run for mayor in the spring primary.

More: https://www.inquirer.com/politics/philadelphia/philadelphia-city-council-fills-open-seats-allan-domb-20220816.html

3. Food inspections in Philadelphia¶

Extract the following:

  • the names and number of violations per inspection for food-borne risk factors (as a DataFrame)
  • the total number of violations

Note: we are looking for food-borne violations only, and not all restaurants listed will have food-borne violations listed

Relevant URL: http://data.inquirer.com/inspections

In [70]:
# Parse the HTML
url = "http://data.inquirer.com/inspections/"
soup4 = BeautifulSoup(requests.get(url).content, 'html.parser')
In [71]:
# This will select all rows of the table
rows = soup4.select(".inspectionUnitInner")

len(rows)
Out[71]:
50
In [73]:
# The first row
rows[0]
Out[73]:
<div class="inspectionUnitInner"><div class="inspectionNameWrapper"><div class="inspectionUnitName transitionAll">7-Eleven #2408-35275J</div><div class="inspectionUnitDate"><span class="inspectionUnitDateTitle">Inspection date:</span> Oct 14, 2022</div><div class="clearAll"></div></div><div class="inspectionUnitInfoWrapper"><div class="inspectionUnitAddress">1084 N DELAWARE AVE 19125</div><div class="inspectionUnitNeigborhood"></div><div class="clearAll"></div></div><div class="inspectionUnitCountWrapper"><span class="inspectionCountLabel">Violations</span><li class="inspectionUnitCount inspectionUnitCountFoodborne inspectionUnitCountFirst"><span class="inspectionCountNumber">4</span><span class="inspectionUnitInfoItemTitle"><span class="inspectionUnitInfoItemTitleLabel">Foodborne Illness Risk Factors</span></span></li><li class="inspectionUnitCount inspectionUnitCountRetail"><span class="inspectionCountNumber">7</span><span class="inspectionUnitInfoItemTitle"><span class="inspectionUnitInfoItemTitleLabel">Lack of Good Retail Practices</span></span></li><div class="clearAll"></div></div><div class="clearAll"></div></div>
In [74]:
# Keep track of the restaurant names and violations
names = []
violations = []

# Loop over each row
for row in rows:
    
    # The name of the restaurant
    name_tag = row.select_one(".inspectionUnitName")
    name = name_tag.text
    
    # The number of foodborne violations
    count = row.select_one(".inspectionUnitCountFoodborne .inspectionCountNumber")
    
    # Only save it if count was listed (0 violations will show up as None)
    if count is not None:
        names.append(name)
        violations.append(int(count.text))

df = pd.DataFrame({"name" : names, "violations" : violations})

df.sort_values("violations", ascending=False)
Out[74]:
name violations
7 Cerda Grocery Inc. 7
8 Fairmount Pizzeria 6
0 7-Eleven #2408-35275J 4
1 B&R Grocery 4
10 Care to Learn Child Development Center 3
15 Girard Neighborhood Food Market 3
13 Delicias Meat & Produce 3
12 Delianny Mini Market 3
11 Cousin's Fresh Market 3 3
14 G & J 1526 Tasker Grocery 3
9 Aid For Friends 3
2 Great Valu 2
18 Cabrera,Javier/Tacos La Charreada Inc/V07250 2
17 Brunch N 2
16 Abi's Bargain Outlet 2
19 Charles Audenreid Charter Private School 2
3 52 Kings Food Market Inc 1
4 Gilbert Spruance School 1
5 Haydee Mini Market 1
6 James J. Sullivan Elementary School 1
20 3J's Food Market 1
21 A S Jenks School 1
22 AM Deli Grocery II Inc. 1
23 Burger Fi 1
24 Discount Store 2 1
25 Germantown Home 1
26 Holy Cross Parish School 1
27 Julia DeBurgos Bilingual School 1
28 Knorr Street Shoprite Inc 1
In [76]:
print("total number of foodborne violations = ", df['violations'].sum())
total number of foodborne violations =  65

Part 2: What about dynamic content?¶

How do you scrape data that only appears after user interaction?

Selenium¶

Note: web browser needed¶

You'll need a web browser installed to use selenium, e.g., FireFox, Google Chrome, Edge, etc.

Selenium¶

  • Designed as a framework for testing webpages during development
  • Provides an interface to interact with webpages just as a user would
  • Becoming increasingly popular for web scraping dynamic content from pages

Best by example: Scraping the Philadelphia Municipal Courts portal¶

  • URL: https://ujsportal.pacourts.us/CaseSearch
  • Given a Police incident number, we'll see if there is an associated court case with the incident

Selenium will open a web browser, load the page, and the browser will respond to the commands issued by selenium

In [77]:
# Import the webdriver from selenium
from selenium import webdriver

Initialize the driver¶

The initialization steps will depend on which browser you want to use!

Important: Working on Binder¶

If you are working on Binder, you'll need to use FireFox in "headless" mode, which prevents a browser window from opening.

If you are working locally, it's better to run with the default options — you'll be able to see the browser window open and change as we perform the web scraping.

Using Google Chrome¶

In [90]:
# UNCOMMENT BELOW TO USE CHROME

#from webdriver_manager.chrome import ChromeDriverManager
#from selenium.webdriver.chrome.service import Service


#driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
[WDM] - Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 7.52M/7.52M [00:00<00:00, 24.0MB/s]

Using Firefox¶

If you are working on Binder, use the below code!

In [53]:
# UNCOMMENT BELOW IF ON BINDER

# from webdriver_manager.firefox import GeckoDriverManager
# from selenium.webdriver.firefox.service import Service

# options = webdriver.FirefoxOptions()

# IF ON BINDER, RUN IN "HEADLESS" MODE (NO BROWSER WINDOW IS OPENED)
# COMMENT THIS LINE IF WORKING LOCALLY
# options.add_argument("--headless")

# Initialize
# driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()), options=options)

Using Microsoft Edge¶

In [51]:
# UNCOMMENT BELOW TO USE MICROSOFT EDGE

# from webdriver_manager.microsoft import EdgeChromiumDriverManager
# from selenium.webdriver.edge.service import Service

# driver = webdriver.Edge(service=Service(EdgeChromiumDriverManager().install()))

Run the scraping analysis¶

Strategy:

  • Rely on the Web Inspector to identify specific elements of the webpage
  • Use Selenium to interact with the webpage
    • Change dropdown elements
    • Click buttons

1. Open the URL¶

In [91]:
# Open the URL
url = "https://ujsportal.pacourts.us/CaseSearch"
driver.get(url)

2. Create a dropdown "Select" element¶

We'll need to:

  • Select the dropdown element on the main page by its ID
  • Initialize a selenium Select() object
In [92]:
# Use the Web Inspector to get the css selector of the dropdown select element
dropdown_selector = "#SearchBy-Control > select"
In [93]:
from selenium.webdriver.common.by import By

# Select the dropdown by the element's CSS selector
dropdown = driver.find_element(By.CSS_SELECTOR, dropdown_selector)
In [94]:
from selenium.webdriver.support.ui import Select

# Initialize a Select object
dropdown_select = Select(dropdown)

3. Change the selected text in the dropdown¶

Change the selected element: "Police Incident/Complaint Number"

In [95]:
# Set the selected text in the dropdown element
dropdown_select.select_by_visible_text("Incident Number")

4. Set the incident number¶

In [96]:
# Get the input element for the DC number
incident_input_selector = "#IncidentNumber-Control > input"
incident_input = driver.find_element(By.CSS_SELECTOR, incident_input_selector)
In [97]:
# Clear any existing entry
incident_input.clear()

# Input our example incident number
incident_input.send_keys("1725088232")

5. Click the search button!¶

In [98]:
# Submit the search
search_button_id = "btnSearch"
driver.find_element(By.ID, search_button_id).click()

6. Use BeautifulSoup to parse the results¶

  • Use the page_source attribute to get the current HTML displayed on the page
  • Initialize a "soup" object with the HTML
In [101]:
courtsSoup = BeautifulSoup(driver.page_source, "html.parser")
  • Identify the element holding all of the results
  • Within this container, find the <table> element and each <tr> element within the table
In [102]:
# Select the results container by its ID 
results_table = courtsSoup.select_one("#caseSearchResultGrid")
In [103]:
# Get all of the <tr> rows inside the tbody element 
# NOTE: we using nested selections here!
results_rows = results_table.select("tbody > tr")

Example: The number of court cases

In [104]:
# Number of court cases
number_of_cases = len(results_rows)
print(f"Number of courts cases: {number_of_cases}")
Number of courts cases: 2

Example: Extract the text elements from the first row of the results

In [105]:
first_row = results_rows[0]
In [106]:
print(first_row.prettify())
<tr class="slide-active">
 <td class="display-none">
  1
 </td>
 <td class="display-none">
  0
 </td>
 <td>
  MC-51-CR-0030672-2017
 </td>
 <td>
  Common Pleas
 </td>
 <td>
  Comm. v. Velquez, Victor
 </td>
 <td>
  Closed
 </td>
 <td>
  10/13/2017
 </td>
 <td>
  Velquez, Victor
 </td>
 <td>
  09/05/1974
 </td>
 <td>
  Philadelphia
 </td>
 <td>
  MC-01-51-Crim
 </td>
 <td>
  U0981035
 </td>
 <td>
  1725088232-0030672
 </td>
 <td>
  1725088232
 </td>
 <td class="display-none">
 </td>
 <td class="display-none">
 </td>
 <td class="display-none">
 </td>
 <td class="display-none">
 </td>
 <td>
  <div class="grid inline-block">
   <div>
    <div class="inline-block">
     <a class="icon-wrapper" href="/Report/CpDocketSheet?docketNumber=MC-51-CR-0030672-2017&amp;dnh=%2FGgePQykMpAymRENgxLBzg%3D%3D" target="_blank">
      <img alt="Docket Sheet" class="icon-size" src="https://ujsportal.pacourts.us/resource/Images/svg-defs.svg?v=3-Me4WMBYQPCgs0IdgGyzeTEx_qd5uveL0qyDZoiHPM#icon-document-letter-D" title="Docket Sheet"/>
      <label class="link-text">
       Docket Sheet
      </label>
     </a>
    </div>
   </div>
  </div>
  <div class="grid inline-block">
   <div>
    <div class="inline-block">
     <a class="icon-wrapper" href="/Report/CpCourtSummary?docketNumber=MC-51-CR-0030672-2017&amp;dnh=%2FGgePQykMpAymRENgxLBzg%3D%3D" target="_blank">
      <img alt="Court Summary" class="icon-size" src="https://ujsportal.pacourts.us/resource/Images/svg-defs.svg?v=3-Me4WMBYQPCgs0IdgGyzeTEx_qd5uveL0qyDZoiHPM#icon-court-summary" title="Court Summary"/>
      <label class="link-text">
       Court Summary
      </label>
     </a>
    </div>
   </div>
  </div>
 </td>
</tr>

In [107]:
# Extract out all of the "<td>" cells from the first row
td_cells = first_row.select("td")

# Loop over each <td> cell
for cell in td_cells:
    
    # Extract out the text from the <td> element
    text = cell.text
    
    # Print out text
    if text != "":
        print(text)
1
0
MC-51-CR-0030672-2017
Common Pleas
Comm. v. Velquez, Victor
Closed
10/13/2017
Velquez, Victor
09/05/1974
Philadelphia
MC-01-51-Crim
U0981035
1725088232-0030672
1725088232
Docket SheetCourt Summary

7. Close the driver!¶

In [108]:
driver.close()

Part 3: Automated "git scraping"¶

Screen Shot 2022-10-11 at 10.33.41 PM.png

Coined by Simon Willison in this blog post

Example: @PHLHomicides¶

Current YTD homicide total updates daily on the Police Department’s website

Screen Shot 2022-10-11 at 11.46.59 PM.png

Data is scraped daily, saved to a CSV file, and added to a Github repository

image.png

Data is then tweeted daily, providing an easily accessible record of homicides over time

image.png

Source code is available on Github at nickhand/phl-homicide-bot

Example: Building a Twitter bot for COVID-19 stats 🤖¶

Key features:

  • Web scraping
  • Twitter API
  • Automation

Example repo available at: https://github.com/MUSA-550-Fall-2022/covid-stats-bot

What it does¶

  1. Scrape COVID case count from phila.gov using same code as first example today
  2. Check if data is newer than the latest saved data
  3. If it is, send a tweet with the info and update the saved CSV file

Use Github Actions to run this workflow once a day

Github Actions¶

  • Lots of good documentation to get you up and running quickly: https://docs.github.com/en/actions
  • Allows you to run a pre-defined set of steps (including a python script) on a set schedule (daily, weekly, etc)
  • Generous time/CPU limitations as long as your repo is public

Github Actions¶

  • Scheduled tasks are run via a workflow '.yml' file — these are the instructions!
  • See example file in repo in the .github/workflows folder

Screen Shot 2022-10-11 at 11.34.22 PM.png

Github secrets¶

  • If you have API credentials (such as those for Twitter) you should never commit them to Github directly
  • Instead, store them as secrets in the repository
  • Go to settings -> secrets -> new repository secret

Screen Shot 2022-10-11 at 11.39.46 PM.png

This will allow you to pass your Twitter API credentials to tweepy with compromising security and storing them in plaintext on Github!

The final bot¶

Data is tracked and updated over time in data.csv

Screen Shot 2022-10-11 at 11.42.53 PM.png

Info is also tweeted each time it is updated!

Screen Shot 2022-10-11 at 11.44.44 PM.png

That's it!¶

  • Next week: working with "big" data
  • See you on Monday!
In [ ]: