by Christopher Pang

temperature = 20 

if temperature <= 0: 
    print "water is ice" 
elif temperature > 0 and temperature < 100: 
    print "water is liquid" 
else:
    print "water will boil"
    
# will print "water is liquid"

Example: For Loop (iterate over list and over range)

# create list of integers
myNumbers = [-5, -1, 5, 17, 20]

# loop over list
for num in myNumbers:       
    print "the current number is " + str(num)

# loop over range from 1 to 99
for num in range(1,100):       
    print "the current number is " + str(num)

Example: Function Definition and Function Calling

def calc(x=1, y=2, op = "add"):         

if op == 'add': 
    return x + y    
elif op == 'subtract':        
    return x - y    
else:        
    print 'Valid operations: add, subtract'

calc()                  # uses default value, outputs 3
calc(5, 3, 'add')       # uses arguments, outputs 8

End of Python Basics

Pandas Example: Linear Regression of Life Expectancy for 185 Different Countries

In this example, we will take the GapMinder life expectancy dataset (1916-2015) and perform a linear regression for each of the 185 country within the dataset.

We will then compare the R-Squared of each model to see if a linear model is a good fit for most countries.

To facilitate data analysis, regression and plotting , we will use the ‘pandas’, ‘statsmodel’ and ‘matplotlib’ packages.

A good general overview of the ‘pandas’ package can be found at: http://pandas.pydata.org/pandas-docs/stable/10min.html

Package Imports

    # show plots inline
    %matplotlib inline
    
    import pandas as pd 
    import statsmodels.formula.api as sm
    import matplotlib.pyplot as plt
    
    # set plotting style to 'ggplot'
    plt.style.use('ggplot')
    
    # to surpress a dataframe warning
    pd.options.mode.chained_assignment = None

Reading from CSV File and Basic Analysis

We first retrieve the data into a ‘pandas’ dataframe from a CSV file containing the GapMinder life expectancy dataset.

    # read csv file into dataframe
    df = pd.read_csv('lifeexpectancy.csv')

We can see that the dataframe stores the content of the CSV file much like a spreadsheet format. Notice that the data is currently stored in a pivoted format, with the years listed across a number of columns.

df.head()

Using the describe() function, we can do basic summary statistics for the life expectancy dataset (by year).

df.describe()

Melting the Data (Unpivoting)

In order to better work with the data, we would like to “unpivot” the data so that each row is a separate record. This will enable us to better perform linear regression and other analysis on the dataset.

We can use the melt() function within ‘pandas’ to help us do this. We set the columns(s) that we want to use as the identifier for each record to the ‘id_vars’ argument of the melt() function. All other column(s) will be melted.

    # melt the normalized file
    le = pd.melt(df, id_vars=['Country'])

    # rename columns and then sort by country name and year
    le2 = le.rename(columns={'variable':'year','value':'life_expectancy'})
    le2 = le2.sort(['Country','year'])
    
    # set year variable as numeric (orginally read in as object from csv)
    le2['year'] = le2['year'].convert_objects(convert_numeric=True)
    
    # check to see if dataframe is properly melted
    le2.head(10)

Basic Line Plot

We can then plot the data for one country, for instace Canada to see how the data looks.

    # set plot size
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # filter data for Canada and plot
    countryData = le2[(le2.Country == 'Canada')]
    
    # plot values
    plt.plot(countryData['year'], countryData['life_expectancy'], '.')
    plt.title("Canada - Life Expectancy")
    plt.show()

Simple Linear Regression

We can do a simple linear regression on the selected country’s dataset and then plot the regression line against the actual values. We will use the ‘statsmodel’ ols function perform the linear regression.

    # fit linear regression
    linearModel = sm.ols(formula='life_expectancy ~ year', data=countryData).fit()
    
    # set plot size
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # plot actual values
    plt.plot(countryData['year'], countryData['life_expectancy'], '.')

    # plot predicted values (from fitted regression)
    plt.plot(countryData['year'], linearModel.fittedvalues, 'b')
    plt.title("Canada - Life Expectancy vs Fitted Linear Regression Line")
    plt.show()

We can show the summary of the linear regression using the summary() function.

Looking at the results for Canada, it seems that the linear model is a good fit as the R-squared is very close to 1.

linearModel.summary()

We can retrieve various values (beta coefficient values, R-squared) from the model using the following commands:

# model coefficents 
linearModel.params

Intercept   -468.576000
year           0.274273
dtype: float64

# model intercept coefficient
linearModel.params[0]

-468.57599994618505

# model slope coefficient
linearModel.params[1]

0.27427257049024917

# model R-squared value
linearModel.rsquared

0.95860297854064103

# model coefficient p-values
linearModel.pvalues

Intercept    6.846878e-64
year         1.398615e-69
dtype: float64

# model predicted values
linearModel.predict()

array([ 56.93024511,  57.20451768,  57.47879025,  57.75306282,
        58.0273354 ,  58.30160797,  58.57588054,  58.85015311,
        59.12442568,  59.39869825,  59.67297082,  59.94724339,
        60.22151596,  60.49578853,  60.7700611 ,  61.04433367,
        61.31860624,  61.59287881,  61.86715138,  62.14142395,
        62.41569652,  62.68996909,  62.96424166,  63.23851423,
        63.5127868 ,  63.78705938,  64.06133195,  64.33560452,
        64.60987709,  64.88414966,  65.15842223,  65.4326948 ,
        65.70696737,  65.98123994,  66.25551251,  66.52978508,
        66.80405765,  67.07833022,  67.35260279,  67.62687536,
        67.90114793,  68.1754205 ,  68.44969307,  68.72396564,
        68.99823821,  69.27251079,  69.54678336,  69.82105593,
        70.0953285 ,  70.36960107,  70.64387364,  70.91814621,
        71.19241878,  71.46669135,  71.74096392,  72.01523649,
        72.28950906,  72.56378163,  72.8380542 ,  73.11232677,
        73.38659934,  73.66087191,  73.93514448,  74.20941705,
        74.48368962,  74.75796219,  75.03223477,  75.30650734,
        75.58077991,  75.85505248,  76.12932505,  76.40359762,
        76.67787019,  76.95214276,  77.22641533,  77.5006879 ,
        77.77496047,  78.04923304,  78.32350561,  78.59777818,
        78.87205075,  79.14632332,  79.42059589,  79.69486846,
        79.96914103,  80.2434136 ,  80.51768618,  80.79195875,
        81.06623132,  81.34050389,  81.61477646,  81.88904903,
        82.1633216 ,  82.43759417,  82.71186674,  82.98613931,
        83.26041188,  83.53468445,  83.80895702,  84.08322959])

Split-Apply-Combine

Now that we have the linear regression model for a single country (Canada) and can see that it is a reasonable model to use, it would be nice to be able to see if this holds true for all countries.

It would be very tedious to run a separate linear regression model and plot for each of the 185 countries in the dataset.

Fortunately, we can use the ‘group-by’ function in ‘pandas’ to help us. This function will help us to:

(1) Split the data into separate groups (for each country)

(2) Apply a function (such as a linear regression) for each of the grouped datasets

(3) Combine the results of each of the groups into a single dataframe for straight-forward analysis.

Let’s use this functionality to run a regression for all of the countries separately and then compare the R-squared for all the models in a single plot.

In order to do this, first we need to define a function that can be applied to each of the grouped datasets.

This function (to perform a linear regression and return the R-squared value of the model) is defined below:

    def getLinearModelRSquared(modelData):
        linearModel = sm.ols(formula='life_expectancy ~ year', data=modelData).fit()
                             
        return linearModel.rsquared

We can now use the ‘groupby’ function to apply our new function to the dataset for each of the 185 countries. We will re-use the original ‘le2’ dataframe that has all the melted data for all of the countries in this step.

    # group dataframe by country (split)
    grouped = le2.groupby('Country')
        
    # apply function to each dataset and combine results
    rSquaredValues = grouped.apply(getLinearModelRSquared)

    # show the first 5 computed R-Squared values
    rSquaredValues.head()

Country
Afghanistan            0.940734
Albania                0.903918
Algeria                0.975977
Angola                 0.954742
Antigua and Barbuda    0.903550
dtype: float64

We now have the R-Squared values for all of the 185 linear regressions, one for each country’s data set. We can now plot the values into a histogram to see how the models R-Squared values are distributed.

    fig, ax = plt.subplots(figsize=(10, 6))
    rSquaredValues.plot(kind='hist', ax=ax, bins=20)
    plt.title("R-Squared Values for Linear Regression Models")
    
    plt.show()

We can now see that R-squared values for most of the countries are fairly close to 1, so it seems that a linear model is reasonable for many countries.

If we sort our dataframe containing all the R-squared values, we will be able to see which countries have the best fit and which countries have the worst fit using a linear model.

rSquaredValues.sort()
rSquaredValues.tail()

Country
Bhutan         0.980740
South Korea    0.981112
Senegal        0.981843
Bolivia        0.983170
Nicaragua      0.984132
dtype: float64

    # filter data for Nicaragua (best fit based on R-squared)
    countryData = le2[(le2.Country == 'Nicaragua')]

    # fit linear regression
    linearModel = sm.ols(formula='life_expectancy ~ year', data=countryData).fit()
    
    # set plot size
    fig, ax = plt.subplots(figsize=(10, 6))
    
    plt.plot(countryData['year'], countryData['life_expectancy'], '.')
    plt.plot(countryData['year'], linearModel.fittedvalues, 'b')
    plt.title("Nicaragua  - Life Expectancy vs Fitted Linear Regression Line")
    plt.show()

rSquaredValues.head()

Country
Lesotho      0.505176
Zambia       0.533450
Belarus      0.548027
Swaziland    0.570214
Zimbabwe     0.582147
dtype: float64

    # filter data for Lesotho (worst fit based on R-squared)
    countryData = le2[(le2.Country == 'Lesotho')]

    # fit linear regression
    linearModel = sm.ols(formula='life_expectancy ~ year', data=countryData).fit()
    
    # set plot size
    fig, ax = plt.subplots(figsize=(10, 6))
    
    plt.plot(countryData['year'], countryData['life_expectancy'], '.')
    plt.plot(countryData['year'], linearModel.fittedvalues, 'b')
    plt.title("Lesotho - Life Expectancy vs Fitted Linear Regression Line")
    plt.show()

End of ‘pandas example’

In-Class Assignment

Please complete the following as part of your work for this class. There are 5 questions and 1 optional bonus question.

You should be able to complete the assignment using materials learnt from the presentation and from the ‘pandas example: linear regression of life expectancy for 185 different countries’ that is part of this document.

Submit the assignment answers as a well-commented Python script named: PandasClass-[YourStudentId].py

Question 1 (conditions, lists, functions and loops)

Write a Python function that will return “odd” if an input number is odd and “even” if an input number is even. One method could be to use the ‘modulo’ or ‘%’ operator as part of a comparison.

Use a ‘for’ loop and a manually created list of integers to test your function for the following 8 integers by printing the results of your function:

-517, -212, -14, 0, 3, 28, 421, 1500

Use the ‘gdppercapita.csv’ file as an input for the following questions. This CSV file contains the GDP per capita for 185 countries in tabular/pivoted form.

Be sure that the CSV file is in the same working directory as your Python script or else it will not be able to find the file when you run your script.

You will likely need to also import the ‘pandas’, ‘statsmodel’ and ‘matplotlib’ packages to complete the following questions.

Question 2 (pandas, file reading and melting)

Read in the ‘gdppercapita.csv’ file into a ‘pandas’ dataframe, melt the data and rename the columns so that the dataframe columns looks like this:

Country | year | gdp_per_capita

You will also need to convert the year column into a numeric type for further analysis.

Print out the summary statistics (mean, standard error, range) for the gdp_per_capita column.

Question 3 (linear regression and plotting)

Filter the dataset for a country of your choice and run a linear regression of gdp_per_capita against the year.

Print the summary for your regression model and then plot the linear regression line (predicted values) versus the actual values.

Write in a single-line python comment whether your country’s dataset fits a linear model well.

Question 4 (split-apply-combine method)

Write a generic function that will return the ‘slope coefficient’ for a linear regression of gdp_per_capita against the year for any input dataset.

Using the split-apply-combine method, apply your generic function to create a dataframe containing the ‘slope coefficient’ values for 185 linear models (1 for each of the 185 countries in your melted dataset).

Plot the ‘slope coefficient’ values into a histogram so that you can better review the distribution.

Write in a single-line pythom comment what the distribution of ‘slope coefficients’ looks like across all the models.

Question 5 (dataframe sorting)

Sort your dataframe containing the ‘slope coefficient’ values so that you can see what countries have the greatest and lowest ‘slope coefficients’.

Plot on the same figure, the regression line for the country you choose in Question 3 and the country that has the highest ‘slope coefficient’ in the entire GapMinder dataset. This will allow you to compare the models of the two countries easily.

(If your original selected country also has the highest ‘slope coefficient’, then plot it against the country with the lowest ‘slope coeffiecient’.

Question 6 (optional bonus)

Apply your python and pandas skills to investigate both the ‘gdppercapita.csv’ and ‘lifeexpectancy.csv’ datasets.

See if you can find or plot some interesting results from a combination of the two datasets (maybe scatterplots). You may need to use outside resources for this question if you try it.

	Country	1916	1917	1918	1919	1920	1921	1922	1923	1924	…	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
0	Afghanistan	27.022387	27.012140	7.045063	26.991647	26.9814	27.073893	27.166387	27.258880	27.351373	…	53.2	53.6	54.0	54.5	54.8	55.2	55.5	56.2	56.91	57.63
1	Albania	35.400000	35.400000	19.427932	35.400000	35.4000	35.395040	35.390080	35.385120	35.380160	…	74.5	74.7	74.9	75.0	75.2	75.5	75.7	75.8	75.90	76.00
2	Algeria	28.287837	28.287837	22.056668	28.287837	27.5000	27.624680	27.465360	29.998040	31.466720	…	74.8	75.0	75.3	75.6	75.9	76.1	76.2	76.3	76.40	76.50
3	Angola	26.980000	26.980000	10.434273	26.980000	26.9800	27.193360	27.406720	27.620080	27.833440	…	56.9	57.6	58.3	58.9	59.4	59.7	60.1	60.4	60.70	61.00
4	Antigua and Barbuda	33.536000	33.536000	21.750121	33.536000	33.5360	33.554040	34.399666	35.245292	36.090919	…	74.4	74.6	74.8	75.1	75.2	75.2	75.2	75.2	75.20	75.20

	1916	1917	1918	1919	1920	1921	1922	1923	1924	1925	…	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
count	185.000000	185.000000	185.000000	185.000000	185.000000	185.000000	185.00000	185.000000	185.000000	185.000000	…	185.000000	185.000000	185.000000	185.000000	185.000000	185.000000	185.000000	185.000000	185.000000	185.000000
mean	33.977578	33.811350	22.596093	33.925875	34.147988	34.684081	35.07504	35.856659	36.188317	36.395107	…	69.077297	69.428108	69.742162	70.074595	70.263784	70.649730	70.945946	71.185405	71.426486	71.669189
std	7.934454	8.027239	11.127740	8.478650	9.100745	9.653877	9.61438	9.445528	9.469711	9.618643	…	9.076710	8.901207	8.744934	8.569927	8.788056	8.319713	8.199794	8.041966	7.888871	7.741353
min	19.000000	20.000000	1.000000	12.000000	15.226000	11.956020	13.91204	23.478440	23.508920	23.539400	…	44.000000	44.400000	45.200000	46.300000	37.000000	48.300000	48.200000	48.300000	48.400000	48.500000
25%	29.800000	29.400000	12.453061	29.700000	29.200000	29.602840	29.80132	30.352100	30.609399	30.701200	…	62.500000	62.900000	63.000000	63.500000	63.800000	64.200000	64.600000	64.900000	65.200000	65.500000
50%	32.000000	32.000000	21.750121	32.000000	31.976800	32.019820	32.22680	32.619160	32.800400	32.878300	…	71.200000	71.300000	72.100000	72.300000	72.400000	72.500000	72.700000	72.700000	72.800000	73.130000
75%	35.600000	35.400000	29.000000	35.500000	35.500000	35.674900	36.67232	37.629471	38.244640	39.030800	…	75.800000	76.200000	76.500000	76.700000	76.800000	76.900000	77.300000	77.500000	77.700000	77.900000
max	58.393922	59.020000	56.250000	59.981569	60.510784	61.778400	62.89760	62.988000	62.978000	63.239000	…	82.300000	82.500000	82.600000	82.800000	83.000000	82.800000	83.200000	83.300000	83.400000	83.500000

	Country	year	life_expectancy
0	Afghanistan	1916	27.022387
185	Afghanistan	1917	27.012140
370	Afghanistan	1918	7.045063
555	Afghanistan	1919	26.991647
740	Afghanistan	1920	26.981400
925	Afghanistan	1921	27.073893
1110	Afghanistan	1922	27.166387
1295	Afghanistan	1923	27.258880
1480	Afghanistan	1924	27.351373
1665	Afghanistan	1925	27.443867

Dep. Variable:	life_expectancy	R-squared:	0.959
Model:	OLS	Adj. R-squared:	0.958
Method:	Least Squares	F-statistic:	2269.
Date:	Wed, 21 Oct 2015	Prob (F-statistic):	1.40e-69
Time:	13:42:44	Log-Likelihood:	-191.68
No. Observations:	100	AIC:	387.4
Df Residuals:	98	BIC:	392.6
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-468.5760	11.318	-41.402	0.000	-491.035 -446.117
year	0.2743	0.006	47.637	0.000	0.263 0.286

COE Toolbox

Useful tools of the operations research trade

Python Basics

Example: If-elif-else Conditional