Week 1B: Exploratory Data Science in Python¶

9/7/2022

Today¶

  • Introduction to Pandas
  • Key data analysis concepts
  • Example: Zillow + Census housing data

Reminder: The weekly workflow¶

  • You'll set up your local Python environment as part of first homework assignment (posted today!)
  • Each week, you will have two options to follow along with lectures:
    1. Using Binder in the cloud, launching via the button on the week's repository
    2. Download the week's repository to your laptop and launch the notebook locally
  • Work on homeworks locally on your laptop — Binder is only a temporary environment (no save features)

To follow along today, go to https://github.com/MUSA-550-Fall-2022/week-1

Screen%20Shot%202020-09-02%20at%2010.31.40%20PM.png

Python Data Analysis: the pandas package¶

Documentation is available at https://pandas.pydata.org

The new, less fun logo¶

The following line imports the pandas package:

In [1]:
import pandas as pd

Basic pandas concepts¶

The primary objects in pandas are the:

  • DataFrame, which is very similar to an Excel spreadsheet, and has rows and named columns
  • Series, which represents a single column of data. A DataFrame contains one or more Series and a name for each Series.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in R.

You can think Series objects as fancier versions of Python's built-in list data type

To create a Series object:

In [2]:
# city names and population
city_names = pd.Series(["San Francisco", "San Jose", "Sacramento"])
population = pd.Series([852469, 1015785, 485199])
In [3]:
city_names
Out[3]:
0    San Francisco
1         San Jose
2       Sacramento
dtype: object

DataFrame objects can be created by passing a dict mapping string column names to their respective Series.

In [4]:
cities_df = pd.DataFrame({"City Name": city_names, "Population": population})
cities_df
Out[4]:
City Name Population
0 San Francisco 852469
1 San Jose 1015785
2 Sacramento 485199

Note: always try to give your variables meaningful names. It will help immensely when you are trying to debug problems or when you're trying to understand the code you've written months later.

Accessing data¶

In [5]:
# access columns with a dict-like syntax
cities_df["Population"]
Out[5]:
0     852469
1    1015785
2     485199
Name: Population, dtype: int64

Rows can be accessed using Python's syntax for slicing¶

In [6]:
# list slicing: get the elements with indices 1 and 2 (but NOT 3)
cities_list = ["San Francisco", "San Jose", "Sacramento"]
cities_list[:2]
Out[6]:
['San Francisco', 'San Jose']

Unfortunately the functionality for slicing lists is not that powerful...but pandas will have many more features!

In [7]:
# slicing data frame rows is very similar!
cities_df[:2]
Out[7]:
City Name Population
0 San Francisco 852469
1 San Jose 1015785

pandas includes functionality for many different ways of selecting data. See the documentation for many more examples.

Manipulating data¶

In [8]:
cities_df["Population"].median()
Out[8]:
852469.0

NumPy is a popular toolkit for scientific computing.

pandas Series can be used as arguments to most NumPy functions:

In [9]:
import numpy as np

# calculate the median population value
np.median(cities_df["Population"])
Out[9]:
852469.0

For more complex single-column transformations, you can use Series.apply. It accepts a function that is applied to each value in the Series.

For example, we can find which cities have a population greater than a million:

In [10]:
# define our function
def get_large_cities(population):
    return population > 1e6


large_cities_sel = cities_df["Population"].apply(get_large_cities)
In [11]:
large_cities_sel
Out[11]:
0    False
1     True
2    False
Name: Population, dtype: bool
In [12]:
# add the new computed column to our original data frame
cities_df["Large Cities"] = large_cities_sel

small_cities_func = lambda population: population < 1e6

# we can also use lambda (unnamed, inline) functions
cities_df["Small Cities"] = cities_df["Population"].apply(small_cities_func)

# print out
cities_df
Out[12]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
1 San Jose 1015785 True False
2 Sacramento 485199 False True

Data selection¶

We can select the "large" cities by passing the boolean values to the .loc() function of a DataFrame:

In [13]:
cities_df["Large Cities"]
Out[13]:
0    False
1     True
2    False
Name: Large Cities, dtype: bool
In [14]:
cities_df.loc[cities_df["Large Cities"]]
Out[14]:
City Name Population Large Cities Small Cities
1 San Jose 1015785 True False

There is a faster, more concise, to achieve this!¶

In [15]:
cities_df["Population"] > 1e6
Out[15]:
0    False
1     True
2    False
Name: Population, dtype: bool
In [16]:
# this is equivalent to doing
cities_df.loc[cities_df["Population"] < 1e6]
Out[16]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
2 Sacramento 485199 False True

What about getting rows where "Large Cities" is False?¶

Use the Python tilde operator to do a logicial not operation:

In [17]:
# reverse the large cities boolean selection
cities_df.loc[~large_cities_sel]
Out[17]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
2 Sacramento 485199 False True
In [18]:
# or equivalently:
# NOTE: you need to put the whole expression in () and then apply the tilde!
cities_df.loc[~(cities_df["Population"] > 1e6)]
Out[18]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
2 Sacramento 485199 False True

An even faster way!¶

We can use the pandas query function.

The query() function will return a subset of your dataframe based on a string version of the boolean expression.

In [19]:
cities_df.query("Population < 1e6")
Out[19]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
2 Sacramento 485199 False True

How to remember the specifics of all of these functions?¶

The documentation is your best friend! Use the question mark operator!

In [20]:
# use the question mark
pd.DataFrame.loc?

Now onto a more interesting example...¶

Analyzing trends in rentals rates over time¶

Let's load census data on rentals rates from the 2019 Amercian Community Survey.

We've stored the data locally in the comma-separated value (CSV) format. Later in the course, you'll learn how to download them directly using Python.

Use the read_csv() function from pandas. The first argument to read_csv() is the file path to load.

In [21]:
census_df = pd.read_csv("./data/census/ACS_19_1YR_B25003.csv")
In [22]:
census_df.head(n=5)
Out[22]:
GEO.id GEO.display-label HD01_VD01 HD02_VD01 HD01_VD02 HD02_VD02 HD01_VD03 HD02_VD03
0 1600000US0103076 Auburn city, Alabama 26156 2687 13095 1829 13061 2108
1 1600000US0107000 Birmingham city, Alabama 93300 3618 37788 2635 55512 2788
2 1600000US0121184 Dothan city, Alabama 27103 1044 15891 824 11212 817
3 1600000US0135896 Hoover city, Alabama 32461 1606 22391 1515 10070 1496
4 1600000US0137000 Huntsville city, Alabama 88930 2823 48728 2758 40202 2550

Select certain columns and rename them.

  • Total: Total occupied housing units
  • Rentals: Renter-occupied housing units
In [23]:
## FIRST STEP: let's trim to the columns we want

# columns we want
columns = ["GEO.display-label", "HD01_VD01", "HD01_VD03"]
census_df = census_df[columns]
In [24]:
# Peak at the dataframe with trimmed columns
census_df.head()
Out[24]:
GEO.display-label HD01_VD01 HD01_VD03
0 Auburn city, Alabama 26156 13061
1 Birmingham city, Alabama 93300 55512
2 Dothan city, Alabama 27103 11212
3 Hoover city, Alabama 32461 10070
4 Huntsville city, Alabama 88930 40202
In [25]:
## STEP 2: Let's rename the columns!

# rename columns
census_df.columns = ["City", "Total", "Rentals"]

# Return the head (first 5 rows) from the cell
census_df.head(n=5)
Out[25]:
City Total Rentals
0 Auburn city, Alabama 26156 13061
1 Birmingham city, Alabama 93300 55512
2 Dothan city, Alabama 27103 11212
3 Hoover city, Alabama 32461 10070
4 Huntsville city, Alabama 88930 40202
In [26]:
# Step 3: Add the rental rate as a new column
census_df['Rental Rate'] = census_df['Rentals'] / census_df['Total'] * 100

census_df.head()
Out[26]:
City Total Rentals Rental Rate
0 Auburn city, Alabama 26156 13061 49.935005
1 Birmingham city, Alabama 93300 55512 59.498392
2 Dothan city, Alabama 27103 11212 41.368114
3 Hoover city, Alabama 32461 10070 31.021842
4 Huntsville city, Alabama 88930 40202 45.206342
In [27]:
# How many rows are in the dataframe? --> use the len() operator

len(census_df)
Out[27]:
634

We can select the largest cities by population using the Series.isin() function, which checks if each value in the Series is in the specified list.

In [28]:
# Define the list of city names we want
top_cities_list = [
    "Philadelphia city, Pennsylvania",
    "New York city, New York",
    "Los Angeles city, California",
    "Houston city, Texas",
    "Chicago city, Illinois",
    "Phoenix city, Arizona",
]
In [29]:
# Use the isin() to test whether a row value is in a city
census_df["City"].isin(top_cities_list)
Out[29]:
0      False
1      False
2      False
3      False
4      False
       ...  
629    False
630    False
631    False
632    False
633    False
Name: City, Length: 634, dtype: bool
In [30]:
# Now do the selection!
census_df.loc[census_df["City"].isin(top_cities_list)]  # select the valid rows
Out[30]:
City Total Rentals Rental Rate
19 Phoenix city, Arizona 586878 267225 45.533314
94 Los Angeles city, California 1398900 887785 63.463078
284 Chicago city, Illinois 1080345 603352 55.848086
439 New York city, New York 3211033 2188214 68.146730
487 Philadelphia city, Pennsylvania 619505 295783 47.745055
538 Houston city, Texas 876504 522942 59.662249

Only 6 rows as expected!

Alternatively, we could use the query() function. The syntax is a bit different (and I always forget and have to look at the documentation).

In the query() string, you can reference existing variables by prefacing the variable name with an "@" symbol.

For example:

In [31]:
# Get the rows where the City column is in "top_cities_list"
census_df.query("City in @top_cities_list")
Out[31]:
City Total Rentals Rental Rate
19 Phoenix city, Arizona 586878 267225 45.533314
94 Los Angeles city, California 1398900 887785 63.463078
284 Chicago city, Illinois 1080345 603352 55.848086
439 New York city, New York 3211033 2188214 68.146730
487 Philadelphia city, Pennsylvania 619505 295783 47.745055
538 Houston city, Texas 876504 522942 59.662249

Python f-strings¶

Python version 3.6 (we are using python 3.9) introduced a new and improved string formatting syntax, where the variable that is being inserted can be referenced directly using the variable name.

The variable name should be enclosed in curly braces inside the string:

In [32]:
year_str = "19"
path = f"./data/census/ACS_{year_str}_1YR_B25003.csv"
print(path) 
./data/census/ACS_19_1YR_B25003.csv

For more info: see this guide

Now let's load data from 2005 to 2019¶

Let's use a for loop to loop over each year value from 2005 to 2019

In [33]:
all_rental_data = []
start_year = 2005
stop_year = 2019
for year in range(start_year, stop_year + 1):
    
    #---------------------------------------------
    # Step 0: Print out year 
    # The variable year changes for each iteration of the loop
    #----------------------------------------------
    print(f"Loading data for year = {year}...")
    
    #---------------------------------
    # Step 1: Read data for this year
    #---------------------------------
    year_str = str(year) # convert integer value of "year" to a string
    year_str = year_str[2:] # extract the last two digits of the year string
    rental_df = pd.read_csv(f"./data/census/ACS_{year_str}_1YR_B25003.csv")
     
    #---------------------------------
    # Step 2: Columns we want
    #---------------------------------
    columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
    rental_df = rental_df[columns]

    #---------------------------------
    # Step 3: Rename columns
    #---------------------------------
    rental_df.columns = ['City', 'Total', 'Rentals']
    
    #----------------------------------
    # Step 4: Calculate the rental rate
    #----------------------------------
    rental_df['Rental Rate'] = rental_df['Rentals'] / rental_df['Total'] * 100
    
    #----------------------------------
    # Step 5: Select the cities we want
    #----------------------------------
    selection = rental_df['City'].isin(top_cities_list)
    top_cities_df = rental_df.loc[selection].copy()
    
    #----------------------------------
    # Step 6: Add a column for the year
    #----------------------------------
    top_cities_df['Year'] = year
    
    #-----------------------------------
    # Step 7: Save this dataframe in the list
    #-----------------------------------
    all_rental_data.append(top_cities_df)

#---------------------------------------------------
# Step 8: After the for loop, combine all of the 
# data frames into one along the row axis
#---------------------------------------------------
all_rental_data = pd.concat(all_rental_data, axis=0)
Loading data for year = 2005...
Loading data for year = 2006...
Loading data for year = 2007...
Loading data for year = 2008...
Loading data for year = 2009...
Loading data for year = 2010...
Loading data for year = 2011...
Loading data for year = 2012...
Loading data for year = 2013...
Loading data for year = 2014...
Loading data for year = 2015...
Loading data for year = 2016...
Loading data for year = 2017...
Loading data for year = 2018...
Loading data for year = 2019...
In [34]:
# Note: the pd.concat() took a list of dataframes and converted them 
# into a single dataframe!
type(all_rental_data)
Out[34]:
pandas.core.frame.DataFrame
In [35]:
all_rental_data.head()
Out[35]:
City Total Rentals Rental Rate Year
13 Phoenix city, Arizona 503753 204914 40.677475 2005
75 Los Angeles city, California 1284124 771325 60.066240 2005
218 Chicago city, Illinois 1020605 525620 51.500825 2005
349 New York city, New York 3026196 2023924 66.880136 2005
393 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
In [36]:
# What about the tail?
all_rental_data.tail()
Out[36]:
City Total Rentals Rental Rate Year
94 Los Angeles city, California 1398900 887785 63.463078 2019
284 Chicago city, Illinois 1080345 603352 55.848086 2019
439 New York city, New York 3211033 2188214 68.146730 2019
487 Philadelphia city, Pennsylvania 619505 295783 47.745055 2019
538 Houston city, Texas 876504 522942 59.662249 2019
In [37]:
# How many rows? Hopefully (6 cities x 15 years = 90!)
len(all_rental_data)
Out[37]:
90

Basic plotting: matplotlib¶

Much more to come next week. For now, we'll use some very simple plotting commands with Python's main plotting libary matplotlib.

Import the library we'll use from matplotlib.

In [38]:
from matplotlib import pyplot as plt

We'll use the plot() function which plots a simple x vs y line.

The strategy¶

  1. Create a figure axes to plot to
  2. Loop over each unique city
  3. Select the subset of the data for each city
  4. Plot the 'Year' vs. 'Rental Rate' columns using the plot() function for the sub-selection data frame

Note¶

To find the unique city names in the 'City' column, we can use the unique() function:

In [39]:
all_rental_data['City'].unique()
Out[39]:
array(['Phoenix city, Arizona', 'Los Angeles city, California',
       'Chicago city, Illinois', 'New York city, New York',
       'Philadelphia city, Pennsylvania', 'Houston city, Texas'],
      dtype=object)
In [40]:
with plt.style.context('ggplot'):
    
    # Step 1: Create a figure and axes to plot on
    fig, ax = plt.subplots(figsize=(10,6))
    
    # Step 2: Loop over each city and plot
    for city in all_rental_data['City'].unique():
        
        # Print out the city in each loop iteration
        print(city)
        
        # Select data for this city
        this_city = all_rental_data['City'] == city
        city_df = all_rental_data.loc[this_city]
        
        # Plot Year vs Rental Rate
        ax.plot(city_df['Year'], city_df['Rental Rate'], label=city, linewidth=4)

    # Add a legend
    ax.legend(loc=0, ncol=3, fontsize=12)

    # Set the y limit
    ax.set_ylim(30, 72)

    # Add a y label
    ax.set_ylabel("Rental Rate in Percent")
    
    # Show
    plt.show()
Phoenix city, Arizona
Los Angeles city, California
Chicago city, Illinois
New York city, New York
Philadelphia city, Pennsylvania
Houston city, Texas

Note: we used the "ggplot" theme to style our plots. Matplotlib comes with a number of built in styles: see all of the examples here.

From 2005 to 2019, Philadelphia had the third largest percent increase¶

Exercise: Can you calculate what the increase was for Philadelphia?¶

You will need to:

  1. Trim the full data set to just Philadelphia
  2. Select the rental rate for 2005
  3. Select the rental rate for 2019
  4. Calculate the percent change

1. Trim the full data set to just Philadelphia¶

In [87]:
# trim to just Philadelphia first, using this boolean selection index
all_rental_data["City"] == "Philadelphia city, Pennsylvania"
Out[87]:
13     False
75     False
218    False
349    False
393     True
       ...  
94     False
284    False
439    False
487     True
538    False
Name: City, Length: 90, dtype: bool
In [42]:
# select Philadelphia only
philly = all_rental_data.loc[
    all_rental_data["City"] == "Philadelphia city, Pennsylvania"
]

philly
Out[42]:
City Total Rentals Rental Rate Year
393 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
400 Philadelphia city, Pennsylvania 554048 231323 41.751437 2006
408 Philadelphia city, Pennsylvania 562384 239363 42.562199 2007
415 Philadelphia city, Pennsylvania 578263 253084 43.766245 2008
419 Philadelphia city, Pennsylvania 569835 258459 45.356814 2009
427 Philadelphia city, Pennsylvania 575413 264129 45.902508 2010
434 Philadelphia city, Pennsylvania 576429 265126 45.994563 2011
437 Philadelphia city, Pennsylvania 579874 277323 47.824700 2012
449 Philadelphia city, Pennsylvania 582528 285430 48.998503 2013
454 Philadelphia city, Pennsylvania 577862 277301 47.987409 2014
458 Philadelphia city, Pennsylvania 581604 275397 47.351291 2015
465 Philadelphia city, Pennsylvania 580205 277707 47.863600 2016
474 Philadelphia city, Pennsylvania 606142 309279 51.024182 2017
486 Philadelphia city, Pennsylvania 608233 287543 47.275140 2018
487 Philadelphia city, Pennsylvania 619505 295783 47.745055 2019

An aside about labels and the index in Pandas¶

Note how the first row of the above dataframe has an index label of 393. This means we can access it using the .loc[] function like:

In [90]:
philly.loc[393]
Out[90]:
City           Philadelphia city, Pennsylvania
Total                                   565433
Rentals                                 244792
Rental Rate                          43.292839
Year                                      2005
Name: 393, dtype: object

This is different than the .iloc[] function, which also indexes based on the integer value of the row, regardless of what the labels in the index are. For example, the first row of a dataframe can always be accessed using:

For more details on the iloc() function, see the documentation on indexing by position.

In [89]:
philly.iloc[0]
Out[89]:
City           Philadelphia city, Pennsylvania
Total                                   565433
Rentals                                 244792
Rental Rate                          43.292839
Year                                      2005
Name: 393, dtype: object
In [92]:
all_rental_data.loc[393]
Out[92]:
City           Philadelphia city, Pennsylvania
Total                                   565433
Rentals                                 244792
Rental Rate                          43.292839
Year                                      2005
Name: 393, dtype: object

We can reset the index labels so they range from 0 to the length of the dataframe, using the reset_index() function. For example

In [45]:
philly.reset_index(drop=True)
Out[45]:
City Total Rentals Rental Rate Year
0 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
1 Philadelphia city, Pennsylvania 554048 231323 41.751437 2006
2 Philadelphia city, Pennsylvania 562384 239363 42.562199 2007
3 Philadelphia city, Pennsylvania 578263 253084 43.766245 2008
4 Philadelphia city, Pennsylvania 569835 258459 45.356814 2009
5 Philadelphia city, Pennsylvania 575413 264129 45.902508 2010
6 Philadelphia city, Pennsylvania 576429 265126 45.994563 2011
7 Philadelphia city, Pennsylvania 579874 277323 47.824700 2012
8 Philadelphia city, Pennsylvania 582528 285430 48.998503 2013
9 Philadelphia city, Pennsylvania 577862 277301 47.987409 2014
10 Philadelphia city, Pennsylvania 581604 275397 47.351291 2015
11 Philadelphia city, Pennsylvania 580205 277707 47.863600 2016
12 Philadelphia city, Pennsylvania 606142 309279 51.024182 2017
13 Philadelphia city, Pennsylvania 608233 287543 47.275140 2018
14 Philadelphia city, Pennsylvania 619505 295783 47.745055 2019

2. Select the rental rate for 2005¶

In [101]:
# select 20
philly_2005 = philly.loc[philly['Year']==2005]


philly_2005
Out[101]:
City Total Rentals Rental Rate Year
393 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
In [102]:
# Notice b/c we have a dataframe with length of 1, we don't get back just a number
philly_2005["Rental Rate"]
Out[102]:
393    43.292839
Name: Rental Rate, dtype: float64
In [106]:
philly_2005['Rental Rate'].iloc[0]
Out[106]:
43.292839293072745
In [48]:
len(philly_2005["Rental Rate"])
Out[48]:
1

An aside: the squeeze() function¶

Notice that philly_2005 has only a single row. It's more useful if we had this a scalar (just a single number)

We can use the .squeeze() function to do this. It does just one it sounds like: if you have a DataFrame with only one row, it will "squeeze" the row dimension by removing it, returning just a Series object:

In [97]:
philly_2005 = philly.loc[philly['Year']==2005].squeeze()

philly_2005
Out[97]:
City           Philadelphia city, Pennsylvania
Total                                   565433
Rentals                                 244792
Rental Rate                          43.292839
Year                                      2005
Name: 393, dtype: object
In [98]:
value_2005 = philly_2005["Rental Rate"]

value_2005
Out[98]:
43.292839293072745

3. Select the rental rate for 2019¶

In [107]:
# Do the same thing for 2019 — make sure you squeeze!
philly_2019 = philly.loc[philly['Year']==2019].squeeze()

# Get the value
value_2019 = philly_2019['Rental Rate']

value_2019
Out[107]:
47.74505451933398

4. Calculate the percent change¶

In [108]:
# calculate percent change
percent_change = (value_2019 - value_2005)/ value_2005 * 100
In [53]:
print(f"change in rental rate in Philadelphia from 2005 to 2019 = {percent_change}")
change in rental rate in Philadelphia from 2005 to 2019 = 10.283952956103837

To be continued next time!¶

First Homework Assignment¶

Available on GitHub:

https://github.com/MUSA-550-Fall-2022/assignment-1

Two parts:

  1. Download and install Python locally on your computer. Instructions in the assignment README!
  2. Analyze Zillow ZHVI data by ZIP code in Philadelphia, submitting your Jupyter notebook.

Due date: Monday 9/19 by the start of class (7pm)

That's it!¶

  • Next week: Data Visualization Fundamentals
  • Office hours:
    • Nick: Saturdays, 10AM-12PM
    • Kristin: Tuesday/Thursday 11AM-12PM, remote via Zoom
    • Sign up for 15-minute time slots on Canvas (Zoom info in calendar invite)
  • Post questions on Piazza!
  • Email questions/concerns to nhand@design.upenn.edu
In [ ]: