Aug. 31, 2022
My team covers a range of policy issues in the city:
--> Check out https://controller.phila.gov/policy-analysis
Goal: exploring and extracting insight from complex datasets
Course has four websites (sorry!). They are:
Each will have its own purpose:
Exploratory Data Science: Students will be introduced to the main tools needed to get started analyzing and visualizing data using Python
Introduction to Geospatial Data Science: Building on the previous set of tools, this module will teach students how to work with geospatial datasets using a range of modern Python toolkits.
Data Ingestion & Big Data: Students will learn how to collect new data through web scraping and APIs, as well as how to work effectively with the large datasets often encountered in real-world applications.
Geospatial Data Science in the Wild: Armed with the necessary data science tools, students will be introduced to a range of advanced analytic and machine learning techniques using a number of innovative examples from modern researchers.
From Exploration to Storytelling: The final module will teach students to present their analysis results using web-based formats to transform their insights into interactive stories.
Homeworks will be assigned (roughly) every two weeks. You must complete five of the seven homework assignments. Four of the assignments are required, and you are allowed to choose the last assignment to complete (out of the remaining three options).
The final project is to replicate the pipeline approach on a dataset (or datasets) of your choosing.
Students will be required to use several of the analysis techniques taught in the class and produce a web-based data visualization that effectively communicates the empirical results to a non-technical audience.
More info will be posted here: https://github.com/MUSA-550-Fall-2022/final-project
https://www.surveymonkey.com/r/TPTM6J3
Very versatile: good for both exploratory data analysis and polished finished products
The official documentation for the Jupyter notebook is a good intro to the basics of the notebook:
Links available on the week-1 GitHub repository as well
See https://colab.research.google.com/notebooks/welcome.ipynb
Note: as a free service, it can be a bit slow sometimes
To follow along today, go to https://github.com/MUSA-550-Fall-2022/week-1
These slides are a Jupyter notebook.
A mix of code cells and text cells in Markdown. Change the type of cell in the top menu bar.
# Comments begin with a "#" character in Python
# A simple code cell
# SHIFT-ENTER to execute
x = 10
print(x)
10
# integer
a = 10
# float
b = 10.5
# string
c = "this is a test string"
# lists
d = list(range(0, 10))
# booleans
e = True
# dictionaries
f = {"key1": 1, "key2": 2}
print(a)
print(b)
print(c)
print(d)
print(e)
print(f)
10 10.5 this is a test string [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] True {'key1': 1, 'key2': 2}
Note: unlike R
, you'll need to use quotes more often in Python, particularly around strings and keys of dictionaries
f = dict(key1 = 1, key2=2, key3=3)
f
{'key1': 1, 'key2': 2, 'key3': 3}
# access the value with key 'key1'
f['key1']
1
d
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# access the second list entry (0 is the first index)
d[1]
1
c
'this is a test string'
# the first character
c[0]
't'
# Python code
result = 0
for i in range(10):
print(i)
result = result + i
0 1 2 3 4 5 6 7 8 9
print(result)
45
a = range(10) # this is an iterator
print(a)
range(0, 10)
# convert it to a list explicitly
a = list(range(10))
print(a)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# or use the INLINE syntax; this is the SAME
a = [i for i in range(10)]
print(a)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
def function_name(arg1, arg2, arg3):
.
.
.
code lines (indented)
.
.
.
return result
def compute_square(x):
return x * x
sq = compute_square(5)
print(sq)
25
def compute_product(x, y=5):
return x * y
# use the default value for y
print(compute_product(5))
25
# specify a y value other than the default
print(compute_product(5, 10))
50
# can also explicitly tell Python which arguments are which
print(compute_product(5, y=2))
print(compute_product(y=2, x=5))
10 10
print(compute_product(x=5, y=4))
20
# argument names must match the function signature though!
print(compute_product(x=5, z=5))
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [39], in <cell line: 2>() 1 # argument names must match the function signature though! ----> 2 print(compute_product(x=5, z=5)) TypeError: compute_product() got an unexpected keyword argument 'z'
Use tab auto-completion and the ? and ?? operators
this_variable_has_a_long_name = 5
# try hitting tab after typing this_
this_variable_has_a_long_name
5
# Forget how to create a range? --> use the help message
range?
Use the ?? operator
# Lets re-define compute_product() and add a docstring between """ """
def compute_product(x, y=5):
"""
This computes the product of x and y
This is all part of the comment.
"""
return x * y
compute_product??
The question mark operator gives you access to the help message for any variable or function. I use this frequently and it is the primary method I understand what functions do.
This was a very brief introduction. Additional Python tutorials are listed on our course website under "Resources"
https://musa-550-fall-2022.github.io/resources/python/
Recommend tutorial for students with little Python background:
There are also a few good resources from the Berkeley Data Science Institute:
The The Python Data Science Handbook is a free, online textbook covering the Python basics needed in this course. In particular, the first four chapters are excellent:
Note that you can click on the "Open in Colab" button for each chapter and run the examples interactively using Google Colab.
In this class, we will almost exclusively work inside Jupyter notebooks — you'll be writing Python code and doing data analysis directly in the notebook.
The more traditional method of using Python is to put your code into a .py
file and execute it via the command line (known as the Anaconda Prompt on Windows or Terminal app on MacOS).
See this section of the Practical Python Programming tutorial for more info.
There is a file called hello_world.py
in the repository for week 1. If we execute it, it should print out "Hello, World" to the command line.
Let's try it out.
You can run terminal commands directly in the Jupyter notebook's "code" cell by starting the line with a "!"
To list all of the files in the current folder (the "current working directory"), use the ls
command:
! ls
We see the hello_world.py
file listed. Now let's execute it on the command line by using the python
command:
# We can run the same code right in the browser!
print("Hello World!")
! python hello_world.py
Success!
When writing software outside the notebook, it's useful to have an application known as a "code editor". This will provide a nice interface for writing Python code and some even have fancy features, like real-time syntax checking and syntax highlighting.
My recommended option is Visual Studio Code.