Week 2: Data Visualization Fundamentals¶

Sep 12, 2022

Housekeeping¶

  • Piazza website: https://piazza.com/upenn/fall2022/musa550
  • HW #1 due one week from today (9/19)
  • Office hours:
    • Nick: 10am-12pm on Saturday
    • Kristin: Tues/Thurs from 11am-12pm
    • Sign-up for time slots on Canvas calendar

Questions / concerns?

  • Email: nhand@design.upenn.edu
  • Post questions on Piazza

Guides¶

Guides to installing Python, using conda for managing packages, and working with Jupyter notebook on course website:

  • Zero to Python
  • Conda User Guide
  • Common Conda Issues
  • Jupyter User Guide

Git & GitHub Resources¶

  • Setting up git
  • Managing files on GitHub
  • Managing files via the command line

File paths and working directories¶

Piazza post walking through somes tips for managing the folder structure on your laptop:

https://piazza.com/class/l7ewjjtnsdw53o/post/18

Reminder: following along with lectures¶

Easiest option: Binder¶

Screen%20Shot%202021-09-12%20at%205.21.12%20PM.png

Harder option: downloading Github repository contents¶

Screen%20Shot%202022-09-11%20at%208.19.37%20PM.png

Today's agenda¶

Part 1

  • Wrapping up last week's pandas introduction

Part 2

  • A brief overview of data visualization
  • Practical tips on color in data vizualization
  • The Python landscape:
    • matplotlib
    • seaborn
    • altair

Continuing with pandas: Zillow rental and home value data¶

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
In [2]:
%matplotlib inline

Load citywide Zillow Rent Index (ZRI) and Zillow Home Value Index (ZHVI) data.

Files were downloaded from https://www.zillow.com/research/data/

In [3]:
home_values = pd.read_csv("data/zillow/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")
rent_values = pd.read_csv("data/zillow/Metro_ZORI_AllHomesPlusMultifamily_SSA.csv")

Peek at the first few rows of the ZRI data:

In [4]:
rent_values.head()
Out[4]:
RegionID RegionName SizeRank 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 ... 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07
0 102001 United States 0 1356.0 1361 1367.0 1373 1378 1384 1390 ... 1712 1721 1729 1738 1747 1757 1766 1776.0 1786 1796
1 394913 New York, NY 1 2205.0 2214 2224.0 2234 2244 2254 2264 ... 2437 2433 2428 2424 2421 2418 2415 2414.0 2413 2413
2 753899 Los Angeles-Long Beach-Anaheim, CA 2 1868.0 1879 1890.0 1902 1913 1924 1935 ... 2529 2538 2546 2554 2563 2572 2581 2591.0 2601 2611
3 394463 Chicago, IL 3 1437.0 1441 1445.0 1449 1453 1456 1460 ... 1651 1653 1655 1657 1659 1662 1664 1667.0 1670 1674
4 394514 Dallas-Fort Worth, TX 4 1179.0 1182 1186.0 1190 1194 1198 1202 ... 1519 1529 1540 1551 1562 1573 1585 1597.0 1608 1620

5 rows × 94 columns

And do the same for the ZHVI data:

In [5]:
home_values.head()
Out[5]:
RegionID SizeRank RegionName RegionType StateName 2000-01-31 2000-02-29 2000-03-31 2000-04-30 2000-05-31 ... 2021-10-31 2021-11-30 2021-12-31 2022-01-31 2022-02-28 2022-03-31 2022-04-30 2022-05-31 2022-06-30 2022-07-31
0 102001 0 United States Country NaN 128193.0 128540.0 128904.0 129647.0 130396.0 ... 315164.0 319079.0 323445.0 328180.0 333400.0 338673.0 344230.0 349265.0 353446.0 355852.0
1 394913 1 New York, NY Msa NY 225281.0 226627.0 227838.0 230222.0 232442.0 ... 571576.0 575145.0 578396.0 582369.0 587022.0 593470.0 600621.0 608159.0 614185.0 618517.0
2 753899 2 Los Angeles-Long Beach-Anaheim, CA Msa CA 228935.0 229732.0 230953.0 233274.0 235763.0 ... 842992.0 850126.0 859335.0 870012.0 882230.0 899552.0 918626.0 935074.0 936406.0 933282.0
3 394463 3 Chicago, IL Msa IL 170843.0 171246.0 171768.0 172812.0 173918.0 ... 287525.0 290233.0 292966.0 295614.0 298193.0 301504.0 304988.0 308971.0 311970.0 313828.0
4 394514 4 Dallas-Fort Worth, TX Msa TX 130922.0 131027.0 131114.0 131327.0 131550.0 ... 334386.0 339874.0 345774.0 353216.0 361748.0 369869.0 379389.0 388117.0 395694.0 397905.0

5 rows × 276 columns

Selecting the cities we want¶

In [6]:
valid_cities = [
    "New York, NY",
    "Chicago, IL",
    "Los Angeles-Long Beach-Anaheim, CA",
    "Philadelphia, PA",
    "Houston, TX",
    "Phoenix, AZ",
]
In [7]:
selection =  home_values['RegionName'].isin(valid_cities)
home_values_trimmed = home_values.loc[selection]
In [8]:
selection = rent_values['RegionName'].isin(valid_cities)
rent_values_trimmed = rent_values.loc[selection]
In [9]:
rent_values_trimmed
Out[9]:
RegionID RegionName SizeRank 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 ... 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07
1 394913 New York, NY 1 2205.0 2214 2224.0 2234 2244 2254 2264 ... 2437 2433 2428 2424 2421 2418 2415 2414.0 2413 2413
2 753899 Los Angeles-Long Beach-Anaheim, CA 2 1868.0 1879 1890.0 1902 1913 1924 1935 ... 2529 2538 2546 2554 2563 2572 2581 2591.0 2601 2611
3 394463 Chicago, IL 3 1437.0 1441 1445.0 1449 1453 1456 1460 ... 1651 1653 1655 1657 1659 1662 1664 1667.0 1670 1674
5 394974 Philadelphia, PA 5 1456.0 1458 1459.0 1461 1463 1465 1467 ... 1723 1729 1735 1741 1748 1754 1761 1768.0 1774 1781
6 394692 Houston, TX 6 1135.0 1142 1149.0 1155 1161 1168 1174 ... 1319 1325 1331 1336 1342 1348 1354 1361.0 1368 1374
14 394976 Phoenix, AZ 14 997.0 1001 1005.0 1009 1013 1017 1021 ... 1530 1551 1573 1595 1617 1640 1662 1686.0 1709 1732

6 rows × 94 columns

Removing unwanted columns¶

Unwanted columns can be dropped from the data frame using the drop() function.

Note that the column axis is the second axis (axis=1), and if you wanted to remove rows, you could use the first axis (axis=0).

In [10]:
x = ['SizeRank', 'RegionID', "RegionType", "StateName"]
home_values_final = home_values_trimmed.drop(x, axis=1)
In [11]:
columns = ['SizeRank', 'RegionID']
rent_values_final = rent_values_trimmed.drop(columns, axis=1)
In [12]:
rent_values_final
Out[12]:
RegionName 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 ... 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07
1 New York, NY 2205.0 2214 2224.0 2234 2244 2254 2264 2273 2283 ... 2437 2433 2428 2424 2421 2418 2415 2414.0 2413 2413
2 Los Angeles-Long Beach-Anaheim, CA 1868.0 1879 1890.0 1902 1913 1924 1935 1947 1958 ... 2529 2538 2546 2554 2563 2572 2581 2591.0 2601 2611
3 Chicago, IL 1437.0 1441 1445.0 1449 1453 1456 1460 1463 1467 ... 1651 1653 1655 1657 1659 1662 1664 1667.0 1670 1674
5 Philadelphia, PA 1456.0 1458 1459.0 1461 1463 1465 1467 1469 1471 ... 1723 1729 1735 1741 1748 1754 1761 1768.0 1774 1781
6 Houston, TX 1135.0 1142 1149.0 1155 1161 1168 1174 1180 1186 ... 1319 1325 1331 1336 1342 1348 1354 1361.0 1368 1374
14 Phoenix, AZ 997.0 1001 1005.0 1009 1013 1017 1021 1025 1030 ... 1530 1551 1573 1595 1617 1640 1662 1686.0 1709 1732

6 rows × 92 columns

Wide vs long format for datasets¶

Currently, our data is in wide format $\rightarrow$ each observation has its own column. This usually results in many columns but few rows.

In [13]:
home_values_final
Out[13]:
RegionName 2000-01-31 2000-02-29 2000-03-31 2000-04-30 2000-05-31 2000-06-30 2000-07-31 2000-08-31 2000-09-30 ... 2021-10-31 2021-11-30 2021-12-31 2022-01-31 2022-02-28 2022-03-31 2022-04-30 2022-05-31 2022-06-30 2022-07-31
1 New York, NY 225281.0 226627.0 227838.0 230222.0 232442.0 234805.0 237288.0 239859.0 242437.0 ... 571576.0 575145.0 578396.0 582369.0 587022.0 593470.0 600621.0 608159.0 614185.0 618517.0
2 Los Angeles-Long Beach-Anaheim, CA 228935.0 229732.0 230953.0 233274.0 235763.0 238114.0 240521.0 243047.0 245453.0 ... 842992.0 850126.0 859335.0 870012.0 882230.0 899552.0 918626.0 935074.0 936406.0 933282.0
3 Chicago, IL 170843.0 171246.0 171768.0 172812.0 173918.0 175015.0 176228.0 177589.0 178995.0 ... 287525.0 290233.0 292966.0 295614.0 298193.0 301504.0 304988.0 308971.0 311970.0 313828.0
5 Philadelphia, PA 130359.0 130796.0 131058.0 131710.0 132238.0 132835.0 133383.0 133967.0 134615.0 ... 310909.0 313401.0 315690.0 318248.0 320804.0 324066.0 327757.0 332095.0 335593.0 338231.0
6 Houston, TX 126394.0 126472.0 126353.0 126472.0 126479.0 126681.0 126811.0 127090.0 127446.0 ... 274426.0 277596.0 281019.0 284479.0 288795.0 294155.0 300160.0 305742.0 309933.0 312533.0
14 Phoenix, AZ 142839.0 143215.0 143669.0 144561.0 145451.0 146126.0 146849.0 147535.0 148362.0 ... 418635.0 424834.0 430947.0 436727.0 445371.0 454535.0 465962.0 474655.0 480985.0 478985.0

6 rows × 272 columns

Usually it's better to have data in tidy (also known as long) format.

Tidy datasets are arranged such that each variable is a column and each observation is a row.

In our case, we want to have a column called ZRI and one called ZHVI and a row for each month that the indices were measured.

pandas provides the melt() function for converting from wide formats to tidy formats.

melt() doesn’t aggregate or summarize the data. It transforms it into a different shape, but it contains the exact same information as before.

Imagine you have 6 rows of data (each row is a unique city) with 10 columns of home values (each column is a different month). That is wide data and is the format usually seen in spreadsheets or tables in a report.

If you melt() that wide data, you would get a table with 60 rows and 3 columns. Each row would contain the city name, the month, and the home value that city and month. This tidy-formatted data contains the same info as the wide data, but in a different form.

tidyr-spread-gather.gif

This animation shows the transformation from wide to long / long to wide. You can ignore gather() and spread() - those are the R versions of the pandas functions.

In [14]:
pd.melt?

Now, let's melt our datasets:

In [15]:
ZHVI = pd.melt(
    home_values_final, 
    id_vars=["RegionName"], 
    value_name="ZHVI", 
    var_name="Date"
)
ZRI = pd.melt(
    rent_values_final, 
    id_vars=["RegionName"], 
    value_name="ZRI", 
    var_name="Date"
)

and take a look:

In [16]:
ZRI.tail()
Out[16]:
RegionName Date ZRI
541 Los Angeles-Long Beach-Anaheim, CA 2021-07 2611.0
542 Chicago, IL 2021-07 1674.0
543 Philadelphia, PA 2021-07 1781.0
544 Houston, TX 2021-07 1374.0
545 Phoenix, AZ 2021-07 1732.0
In [17]:
ZHVI.head()
Out[17]:
RegionName Date ZHVI
0 New York, NY 2000-01-31 225281.0
1 Los Angeles-Long Beach-Anaheim, CA 2000-01-31 228935.0
2 Chicago, IL 2000-01-31 170843.0
3 Philadelphia, PA 2000-01-31 130359.0
4 Houston, TX 2000-01-31 126394.0

Merging data frames¶

Another common operation is merging, also known as joining, two datasets.

We can use the merge() function to merge observations that have the same Date and RegionName values.

But first! Our date string formats don't match!¶

  • ZVHI has the Date column in the format of YYYY-MM-DD
  • ZRI has the Date column in the format of YYYY-MM

We need to put them into the same format before merging the data!

We can fix this by create Datetime objects and formatting the dates into the same format.

Datetime objects¶

Currently our Date column is stored as a string.

pandas includes additional functionality for dates, but first we must convert the strings using the to_datetime() function.

In [18]:
# Convert the Date column to Datetime objects
ZHVI["Date"] = pd.to_datetime(ZHVI["Date"])

The strftime function¶

We can use the ".dt" property of the Date column to access datetime functions of the new Datetime column.

For converting to strings in a certain format, we can use the "strftime" function (docs). This uses a special syntax to convert the date object to a string with a specific format.

Important reference: Use the this strftime guide to look up the syntax!

In [19]:
# Extract YYYY-MM string
date_strings = ZHVI["Date"].dt.strftime("%Y-%m")
In [20]:
# First entry is a string!
date_strings.iloc[0]
Out[20]:
'2000-01'
In [21]:
# Add the strings back as a column
ZHVI["Date"] = date_strings
In [22]:
ZHVI.head()
Out[22]:
RegionName Date ZHVI
0 New York, NY 2000-01 225281.0
1 Los Angeles-Long Beach-Anaheim, CA 2000-01 228935.0
2 Chicago, IL 2000-01 170843.0
3 Philadelphia, PA 2000-01 130359.0
4 Houston, TX 2000-01 126394.0

Now we can merge!¶

In [23]:
# Left dataframe is ZRI
# Right dataframe is ZHVI

zillow_data = pd.merge(ZRI, ZHVI, on=['Date', 'RegionName'], how='outer')
In [24]:
# Let's sort the data by Date
zillow_data  = zillow_data.sort_values("Date", ascending=True)

zillow_data
Out[24]:
RegionName Date ZRI ZHVI
547 Los Angeles-Long Beach-Anaheim, CA 2000-01 NaN 228935.0
548 Chicago, IL 2000-01 NaN 170843.0
549 Philadelphia, PA 2000-01 NaN 130359.0
550 Houston, TX 2000-01 NaN 126394.0
551 Phoenix, AZ 2000-01 NaN 142839.0
... ... ... ... ...
1622 Chicago, IL 2022-07 NaN 313828.0
1624 Houston, TX 2022-07 NaN 312533.0
1620 New York, NY 2022-07 NaN 618517.0
1621 Los Angeles-Long Beach-Anaheim, CA 2022-07 NaN 933282.0
1625 Phoenix, AZ 2022-07 NaN 478985.0

1626 rows × 4 columns

Merging is very powerful and the merge can be done in a number of ways. In this case, we did a outer merge in order to keep all parts of each dataframe. By contrast, the inner merge only keeps the overlapping intersection of the merge.

See the infographic on joining in this repository.

In [25]:
# Convert the Date column back to a Datetime
zillow_data["Date"] = pd.to_datetime(zillow_data["Date"])

Quick trick: Series that hold Datetime objects have a dt attribute that let's you grab parts of the date easily.

For example, we can easily add new columns for the month and year using:

In [26]:
# Note the the dtype is now datetime64[ns]
zillow_data['Date'].head()
Out[26]:
547   2000-01-01
548   2000-01-01
549   2000-01-01
550   2000-01-01
551   2000-01-01
Name: Date, dtype: datetime64[ns]
In [27]:
# Extract out the month and year of each date
# Add them to the data frame as new columns!
zillow_data['Month'] = zillow_data['Date'].dt.month
zillow_data['Year'] = zillow_data['Date'].dt.year 
In [28]:
zillow_data.head()
Out[28]:
RegionName Date ZRI ZHVI Month Year
547 Los Angeles-Long Beach-Anaheim, CA 2000-01-01 NaN 228935.0 1 2000
548 Chicago, IL 2000-01-01 NaN 170843.0 1 2000
549 Philadelphia, PA 2000-01-01 NaN 130359.0 1 2000
550 Houston, TX 2000-01-01 NaN 126394.0 1 2000
551 Phoenix, AZ 2000-01-01 NaN 142839.0 1 2000

Annual trends: grouping by Year¶

pandas is especially useful for grouping and aggregating data via the groupby() function.

From the pandas documentation, groupby means:

  • Splitting the data into groups based on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure.

The documentation is available here.

We can calculate annual averages for each year by grouping by the RegionName and Year columns and taking the mean of our desired column. For example:

In [29]:
# calculate mean values for each Year and City (RegionName)
annual_ZHVI = zillow_data.groupby(['RegionName', 'Year'])['ZHVI'].mean() 
annual_ZRI = zillow_data.groupby(['RegionName', 'Year'])['ZRI'].mean()
In [30]:
print(type(annual_ZHVI))
<class 'pandas.core.series.Series'>
In [31]:
annual_ZHVI.head()
Out[31]:
RegionName   Year
Chicago, IL  2000    176107.833333
             2001    190745.250000
             2002    205350.500000
             2003    220379.916667
             2004    237042.833333
Name: ZHVI, dtype: float64

Imporant: The result of the groupby operation is always indexed by the group keys!

In this case, the result is indexed by the columns we grouped by (RegionName and Year).

We can reset the index so that the index values are listed as columns in the data frame again.

In [32]:
annual_ZHVI = annual_ZHVI.reset_index()
annual_ZRI = annual_ZRI.reset_index()
In [33]:
annual_ZHVI.head(n=50)
Out[33]:
RegionName Year ZHVI
0 Chicago, IL 2000 176107.833333
1 Chicago, IL 2001 190745.250000
2 Chicago, IL 2002 205350.500000
3 Chicago, IL 2003 220379.916667
4 Chicago, IL 2004 237042.833333
5 Chicago, IL 2005 257666.000000
6 Chicago, IL 2006 274795.083333
7 Chicago, IL 2007 276928.083333
8 Chicago, IL 2008 257053.444444
9 Chicago, IL 2009 227865.000000
10 Chicago, IL 2010 211839.916667
11 Chicago, IL 2011 192180.166667
12 Chicago, IL 2012 179569.166667
13 Chicago, IL 2013 187374.583333
14 Chicago, IL 2014 201120.666667
15 Chicago, IL 2015 207530.833333
16 Chicago, IL 2016 215485.250000
17 Chicago, IL 2017 227410.000000
18 Chicago, IL 2018 237385.750000
19 Chicago, IL 2019 242427.250000
20 Chicago, IL 2020 247735.833333
21 Chicago, IL 2021 275805.416667
22 Chicago, IL 2022 305009.714286
23 Houston, TX 2000 127062.916667
24 Houston, TX 2001 128605.916667
25 Houston, TX 2002 131883.500000
26 Houston, TX 2003 136163.416667
27 Houston, TX 2004 144030.636364
28 Houston, TX 2005 150617.750000
29 Houston, TX 2006 153528.666667
30 Houston, TX 2007 158566.000000
31 Houston, TX 2008 157503.916667
32 Houston, TX 2009 153664.083333
33 Houston, TX 2010 153315.583333
34 Houston, TX 2011 146821.000000
35 Houston, TX 2012 146639.583333
36 Houston, TX 2013 157032.166667
37 Houston, TX 2014 172792.250000
38 Houston, TX 2015 188334.416667
39 Houston, TX 2016 197249.750000
40 Houston, TX 2017 204009.333333
41 Houston, TX 2018 211769.333333
42 Houston, TX 2019 219758.083333
43 Houston, TX 2020 227084.250000
44 Houston, TX 2021 257929.500000
45 Houston, TX 2022 299399.571429
46 Los Angeles-Long Beach-Anaheim, CA 2000 239580.833333
47 Los Angeles-Long Beach-Anaheim, CA 2001 266247.333333
48 Los Angeles-Long Beach-Anaheim, CA 2002 303152.750000
49 Los Angeles-Long Beach-Anaheim, CA 2003 361282.083333

Plotting our results: ZHVI¶

In [34]:
with plt.style.context("ggplot"):

    # Create figure and axes
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Plot for each unique city
    for city in annual_ZHVI["RegionName"].unique():
        
        # select the data for this city
        selection = annual_ZHVI["RegionName"] == city
        df = annual_ZHVI.loc[selection]

        # plot
        ax.plot(df["Year"], df["ZHVI"] / 1e3, label=city, linewidth=4)

    
    # Format the axes
    ax.set_ylim(50, 800)
    ax.legend(loc=0, ncol=2, fontsize=12)
    ax.set_ylabel("Zillow Home Value Index\n(in thousands of dollars)")

Home values in Philadelphia have only recently recovered to pre-2008 levels¶

Plotting the results: Zillow Rent Index¶

In [35]:
with plt.style.context('ggplot'):
    
    # Create the figure and axes
    fig, ax = plt.subplots(figsize=(10,6))
    
    # Loop over the cities to plot each one
    for city in annual_ZRI['RegionName'].unique():
        
        # Select the city data
        selection = annual_ZRI['RegionName'] == city
        df = annual_ZRI.loc[selection]
        
        # Plot
        ax.plot(df['Year'], df['ZRI'], label=city, linewidth=4)
    
    # Format
    ax.set_ylim(1000, 3300)
    ax.legend(loc=0, ncol=2)
    ax.set_ylabel('Zillow Rent Index (in dollars)')

Rent prices in Philadelphia have remained relatively flat, relative to other large cities.¶

Week #2: Data Visualization Fundamentals¶

  • Week #2 repository: https://github.com/MUSA-550-Fall-2022/week-2
  • Recommended readings for the week listed here

A brief history¶

Starting with two of my favorite historical examples, and their modern renditions...

Example 1: the pioneering work of W. E. B. Du Bois¶

1_ElK-_zQVAnEu5Am6LjT7kg.png

1_oQAQnto4oWeXomOTqCAYZA.png

Re-making the Du Bois Spiral with census data¶

The demographics of whites in seven states¶

Green is urban, blue suburban, yellow small town, red rural. Source

Additional references¶

  • A 6-part blog series on W. E. B. Du Bois' Data Visualization
  • An article on historical data visualization
  • W. E. B. Du Bois’s Data Portraits: Visualizing Black America

Example 2: the Statistical Atlas of the United States¶

  • First census: 1790
  • First map for the census: 1850
  • First Statistical Atlas: 1870
  • Largely discontinued after 1890, except for the 2000 Census Atlas

Using modern data¶

See http://projects.flowingdata.com/atlas, by Nathan Yau

Industry and Earnings by Sex¶

Source: American Community Survey, 5-Year, 2009-2013

Median Household Income¶

Many more examples...¶

More recently...¶

Two main movements:¶

  • 1st wave: clarity
  • 2nd wave: the grammar of visualization

Wave 1: Clarity¶

  • Pioneered by Edward Tufte and his release of The Visual Display of Quantitative Information in 1983
  • Focuses on clarity, simplicity, and plain color schemes
  • Charts should be immediately accessible and readable

The idea of "Chartjunk"¶

  • Coined by Tufte in Visual Display
  • Any unnecessary information on a chart

An extreme example¶

Wave 2: the grammar of visualization¶

  • Influenced by The Grammar of Graphics by Leland Wilkinson in 1999
  • Focuses on encoding data via channels onto geometry
  • Mapping data attributes on to graphical channels, e.g., length, angle, color, or position (or any other graphical character)
  • Less focus on clarity, more on the encoding system
  • Leads to many, many (perhaps confusing) ways of visualizing data
  • ggplot2 provides an R implementation of The Grammar of Graphics
  • A few different Python libraries available

Where are we now?¶

  • Both movements converging together
  • More visualization libraries available now than ever

A survey of common tools¶

  • From a 2017 survey by Elijah Meeks
  • Data visualization engineer: Apple, Netflix
  • Excellent data viz resource
  • Find him on Twitter or Medium: @Elijah_Meeks
  • Executive director of the Data Visualization Society

  • Community-based data viz organization
  • Great resources for beginners
  • Check out the Nightingale: The Data Visualization Society's Blog

The 7 kinds of data viz people¶

  • From this blog post
  • Illustrations by Susie Lu

See, e.g. Data Sketches

Data visualization as communication¶

  • Data visualization is primarily a communication and design problem, not a technical one
  • Two main modes:
    • Fast: quickly understood or quickly made (or both!)
    • Slow: more advanced, focus on design, takes longer to understand and/or longer to make

Fast visualization¶

  • Classic trope: a report for busy executives created by subject experts $\rightarrow$ as clear and simplified as possible
  • Leads readers to think that if the chart is not immediately understood then it must be a failure
  • The dominant method of data visualization
    • E.g., Create a Tableau Dashboard in Just 10 Minutes

Moving beyond fast visualizations¶

  • Thinking about what charts say, beyond what is immediately clear
  • Focusing on colors, design choices

Example: Fatalities in the Iraq War¶

by Simon Scarr in 2011

What design choices drive home the implicit message?¶

Data Visualization as Storytelling¶

The same data, but different design choices...

A negative portrayal¶

A positive portrayal¶

Design choices matter & data viz has never been more important¶

Some recent examples...

9ut454s8anx41.jpg

Screen%20Shot%202020-09-05%20at%209.24.54%20PM.png

  • Data Viz's Breakthrough Moment in the COVID-19 Crisis
  • Interview with John Burn-Murdoch About his COVID Data Viz
  • John Burn-Murdoch's Twitter
  • COVID-19 Data Viz from the Financial Times
    • Coronavirus Tracker
    • Global Economic Recovery Tracker

Data Viz Style Guides¶

Lots of companies, cities, institutions, etc. have started design guidelines to improve and standardize their data visualizations.

One I particularly like: City of London Data Design Guidelines

First few pages are listed in the "Recommended Reading" portion of this week's README.

London's style guide includes some basic data viz principles that everyone should know and includes the following example:

City of London Data Design Guidelines

City of London Data Design Guidelines

Good rules¶

  • Less is more — minimize "chartjunk"
  • Don't use legends if you can label directly
  • Use color / line weight to focus the reader on the data you want to emphasize
  • Don't make the viewer tilt their head

— Use titles/subtitles to explain what is being plotted

Now onto colors...¶

Choose your colors carefully:

  • Sequential schemes: for continuous data that progresses from low to high
  • Diverging schemes: for continuous data that emphasizes positive or negative deviations from a central value
  • Qualitative schemes: for data that has no inherent ordering, where color is used only to distinguish categories

ColorBrewer 2.0¶

  • The classic tool for color selection
  • Handles all three types of color schemes and provides a map-based visualization
  • Provides explanations from Cynthia Brewer's published works on color theory
  • Tests whether colors are colorblind safe, printer friendly, and photocopy safe
  • ColorBrewer palettes are included by default in matplotlib

See: http://colorbrewer2.org

Perceptually uniform color maps¶

  • Created for matplotlib and available by default
  • perceptually uniform: equal steps in data are perceived as equal steps in the color space
  • robust to color blindness
  • colorful and beautiful

For quantitative data, these color maps are very strong options

Need more colors?¶

Almost too many tools available...

Some of my favorites¶

  • Adobe Color CC: allows you to explore other people's color palettes and create new ones
  • Paletton: similar to Adobe Color, but slightly more advanced
  • Chroma.js Color Scale Helper: create color palettes by interpolating between named HTML colors
  • Colorpicker for data: automatically generate new color palettes, but they aren't always useful

Making sure your colors work: Viz Palette¶

Wrapping up: some good rules to live by¶

  • Optimize your color map for your dataset
  • Think about who your audience is
  • Avoid palettes with too many colors: ColorBrewer stops at ~9 for a reason
  • Maintain a theme and make it pretty
  • Think about how color interacts with the other parts of the visualization

Note: no easy way to get legend added to the plot in this case...

Disclaimer¶

  • In my experience, I have found the pandas plotting capabilities are good for quick and unpolished plots during the data exploration phase
  • Most of the pandas plotting functions serve as shorcuts, removing some biolerplate matplotlib code
  • If I'm trying to make polished, clean data visualization, I'll usually opt to use matplotlib from the beginning

That's it!¶

  • See you on Wedndesday when we continue Data Viz Fundamentals
In [ ]: