Python is a widely used general-purpose, high-level programming language. Python programs get structured (separation of code blocks) through indentation/whitespaces rather than curly braces that are common in languages such as C++, C# and Java. This means that Python code needs to be properly indented, not only for readability but also for proper execution.
Python can be executed both in interactive mode and in normal script mode. When in interactive mode, you are able to write Python code directly into the a command line shell which gives immediate feedback for each statement. In normal script mode, Python script files (.py) will be executed in the Python interpreter until completion or error.
Tutorials:
https://www.codecademy.com/learn/python
https://developers.google.com/edu/python/
Code Reference:
http://www.cogsci.rpi.edu/~destem/igd/python_cheat_sheet.pdf
https://gist.github.com/filipkral/740a11c827422264c757
Example: If-elif-else Conditional
temperature = 20
if temperature <= 0:
print "water is ice"
elif temperature > 0 and temperature < 100:
print "water is liquid"
else:
print "water will boil"
# will print "water is liquid"
Example: For Loop (iterate over list and over range)
# create list of integers
myNumbers = [-5, -1, 5, 17, 20]
# loop over list
for num in myNumbers:
print "the current number is " + str(num)
# loop over range from 1 to 99
for num in range(1,100):
print "the current number is " + str(num)
Example: Function Definition and Function Calling
def calc(x=1, y=2, op = "add"):
if op == 'add':
return x + y
elif op == 'subtract':
return x - y
else:
print 'Valid operations: add, subtract'
calc() # uses default value, outputs 3
calc(5, 3, 'add') # uses arguments, outputs 8
Pandas Example: Linear Regression of Life Expectancy for 185 Different Countries
In this example, we will take the GapMinder life expectancy dataset (1916-2015) and perform a linear regression for each of the 185 country within the dataset.
We will then compare the R-Squared of each model to see if a linear model is a good fit for most countries.
To facilitate data analysis, regression and plotting , we will use the ‘pandas’, ‘statsmodel’ and ‘matplotlib’ packages.
A good general overview of the ‘pandas’ package can be found at: http://pandas.pydata.org/pandas-docs/stable/10min.html
Package Imports
# show plots inline
%matplotlib inline
import pandas as pd
import statsmodels.formula.api as sm
import matplotlib.pyplot as plt
# set plotting style to 'ggplot'
plt.style.use('ggplot')
# to surpress a dataframe warning
pd.options.mode.chained_assignment = None
Reading from CSV File and Basic Analysis
We first retrieve the data into a ‘pandas’ dataframe from a CSV file containing the GapMinder life expectancy dataset.
# read csv file into dataframe
df = pd.read_csv('lifeexpectancy.csv')
We can see that the dataframe stores the content of the CSV file much like a spreadsheet format. Notice that the data is currently stored in a pivoted format, with the years listed across a number of columns.
df.head()
Using the describe() function, we can do basic summary statistics for the life expectancy dataset (by year).
df.describe()
Melting the Data (Unpivoting)
In order to better work with the data, we would like to “unpivot” the data so that each row is a separate record. This will enable us to better perform linear regression and other analysis on the dataset.
We can use the melt() function within ‘pandas’ to help us do this. We set the columns(s) that we want to use as the identifier for each record to the ‘id_vars’ argument of the melt() function. All other column(s) will be melted.
# melt the normalized file
le = pd.melt(df, id_vars=['Country'])
# rename columns and then sort by country name and year
le2 = le.rename(columns={'variable':'year','value':'life_expectancy'})
le2 = le2.sort(['Country','year'])
# set year variable as numeric (orginally read in as object from csv)
le2['year'] = le2['year'].convert_objects(convert_numeric=True)
# check to see if dataframe is properly melted
le2.head(10)
Basic Line Plot
We can then plot the data for one country, for instace Canada to see how the data looks.
# set plot size
fig, ax = plt.subplots(figsize=(10, 6))
# filter data for Canada and plot
countryData = le2[(le2.Country == 'Canada')]
# plot values
plt.plot(countryData['year'], countryData['life_expectancy'], '.')
plt.title("Canada - Life Expectancy")
plt.show()
Simple Linear Regression
We can do a simple linear regression on the selected country’s dataset and then plot the regression line against the actual values. We will use the ‘statsmodel’ ols function perform the linear regression.
# fit linear regression
linearModel = sm.ols(formula='life_expectancy ~ year', data=countryData).fit()
# set plot size
fig, ax = plt.subplots(figsize=(10, 6))
# plot actual values
plt.plot(countryData['year'], countryData['life_expectancy'], '.')
# plot predicted values (from fitted regression)
plt.plot(countryData['year'], linearModel.fittedvalues, 'b')
plt.title("Canada - Life Expectancy vs Fitted Linear Regression Line")
plt.show()
We can show the summary of the linear regression using the summary() function.
Looking at the results for Canada, it seems that the linear model is a good fit as the R-squared is very close to 1.
linearModel.summary()
We can retrieve various values (beta coefficient values, R-squared) from the model using the following commands:
# model coefficents
linearModel.params
# model intercept coefficient
linearModel.params[0]
# model slope coefficient
linearModel.params[1]
# model R-squared value
linearModel.rsquared
# model coefficient p-values
linearModel.pvalues
# model predicted values
linearModel.predict()
Split-Apply-Combine
Now that we have the linear regression model for a single country (Canada) and can see that it is a reasonable model to use, it would be nice to be able to see if this holds true for all countries.
It would be very tedious to run a separate linear regression model and plot for each of the 185 countries in the dataset.
Fortunately, we can use the ‘group-by’ function in ‘pandas’ to help us. This function will help us to:
(1) Split the data into separate groups (for each country)
(2) Apply a function (such as a linear regression) for each of the grouped datasets
(3) Combine the results of each of the groups into a single dataframe for straight-forward analysis.
Let’s use this functionality to run a regression for all of the countries separately and then compare the R-squared for all the models in a single plot.
In order to do this, first we need to define a function that can be applied to each of the grouped datasets.
This function (to perform a linear regression and return the R-squared value of the model) is defined below:
def getLinearModelRSquared(modelData):
linearModel = sm.ols(formula='life_expectancy ~ year', data=modelData).fit()
return linearModel.rsquared
We can now use the ‘groupby’ function to apply our new function to the dataset for each of the 185 countries. We will re-use the original ‘le2’ dataframe that has all the melted data for all of the countries in this step.
# group dataframe by country (split)
grouped = le2.groupby('Country')
# apply function to each dataset and combine results
rSquaredValues = grouped.apply(getLinearModelRSquared)
# show the first 5 computed R-Squared values
rSquaredValues.head()
We now have the R-Squared values for all of the 185 linear regressions, one for each country’s data set. We can now plot the values into a histogram to see how the models R-Squared values are distributed.
fig, ax = plt.subplots(figsize=(10, 6))
rSquaredValues.plot(kind='hist', ax=ax, bins=20)
plt.title("R-Squared Values for Linear Regression Models")
plt.show()
We can now see that R-squared values for most of the countries are fairly close to 1, so it seems that a linear model is reasonable for many countries.
If we sort our dataframe containing all the R-squared values, we will be able to see which countries have the best fit and which countries have the worst fit using a linear model.
rSquaredValues.sort()
rSquaredValues.tail()
# filter data for Nicaragua (best fit based on R-squared)
countryData = le2[(le2.Country == 'Nicaragua')]
# fit linear regression
linearModel = sm.ols(formula='life_expectancy ~ year', data=countryData).fit()
# set plot size
fig, ax = plt.subplots(figsize=(10, 6))
plt.plot(countryData['year'], countryData['life_expectancy'], '.')
plt.plot(countryData['year'], linearModel.fittedvalues, 'b')
plt.title("Nicaragua - Life Expectancy vs Fitted Linear Regression Line")
plt.show()
rSquaredValues.head()
# filter data for Lesotho (worst fit based on R-squared)
countryData = le2[(le2.Country == 'Lesotho')]
# fit linear regression
linearModel = sm.ols(formula='life_expectancy ~ year', data=countryData).fit()
# set plot size
fig, ax = plt.subplots(figsize=(10, 6))
plt.plot(countryData['year'], countryData['life_expectancy'], '.')
plt.plot(countryData['year'], linearModel.fittedvalues, 'b')
plt.title("Lesotho - Life Expectancy vs Fitted Linear Regression Line")
plt.show()
In-Class Assignment
Please complete the following as part of your work for this class. There are 5 questions and 1 optional bonus question.
You should be able to complete the assignment using materials learnt from the presentation and from the ‘pandas example: linear regression of life expectancy for 185 different countries’ that is part of this document.
Submit the assignment answers as a well-commented Python script named: PandasClass-[YourStudentId].py
Question 1 (conditions, lists, functions and loops)
Write a Python function that will return “odd” if an input number is odd and “even” if an input number is even. One method could be to use the ‘modulo’ or ‘%’ operator as part of a comparison.
Use a ‘for’ loop and a manually created list of integers to test your function for the following 8 integers by printing the results of your function:
-517, -212, -14, 0, 3, 28, 421, 1500
Use the ‘gdppercapita.csv’ file as an input for the following questions. This CSV file contains the GDP per capita for 185 countries in tabular/pivoted form.
Be sure that the CSV file is in the same working directory as your Python script or else it will not be able to find the file when you run your script.
You will likely need to also import the ‘pandas’, ‘statsmodel’ and ‘matplotlib’ packages to complete the following questions.
Question 2 (pandas, file reading and melting)
Read in the ‘gdppercapita.csv’ file into a ‘pandas’ dataframe, melt the data and rename the columns so that the dataframe columns looks like this:
Country | year | gdp_per_capita
You will also need to convert the year column into a numeric type for further analysis.
Print out the summary statistics (mean, standard error, range) for the gdp_per_capita column.
Question 3 (linear regression and plotting)
Filter the dataset for a country of your choice and run a linear regression of gdp_per_capita against the year.
Print the summary for your regression model and then plot the linear regression line (predicted values) versus the actual values.
Write in a single-line python comment whether your country’s dataset fits a linear model well.
Question 4 (split-apply-combine method)
Write a generic function that will return the ‘slope coefficient’ for a linear regression of gdp_per_capita against the year for any input dataset.
Using the split-apply-combine method, apply your generic function to create a dataframe containing the ‘slope coefficient’ values for 185 linear models (1 for each of the 185 countries in your melted dataset).
Plot the ‘slope coefficient’ values into a histogram so that you can better review the distribution.
Write in a single-line pythom comment what the distribution of ‘slope coefficients’ looks like across all the models.
Question 5 (dataframe sorting)
Sort your dataframe containing the ‘slope coefficient’ values so that you can see what countries have the greatest and lowest ‘slope coefficients’.
Plot on the same figure, the regression line for the country you choose in Question 3 and the country that has the highest ‘slope coefficient’ in the entire GapMinder dataset. This will allow you to compare the models of the two countries easily.
(If your original selected country also has the highest ‘slope coefficient’, then plot it against the country with the lowest ‘slope coeffiecient’.
Question 6 (optional bonus)
Apply your python and pandas skills to investigate both the ‘gdppercapita.csv’ and ‘lifeexpectancy.csv’ datasets.
See if you can find or plot some interesting results from a combination of the two datasets (maybe scatterplots). You may need to use outside resources for this question if you try it.