Python Basics

by Christopher Pang

Python is a widely used general-purpose, high-level programming language. Python programs get structured (separation of code blocks) through indentation/whitespaces rather than curly braces that are common in languages such as C++, C# and Java. This means that Python code needs to be properly indented, not only for readability but also for proper execution.


Python can be executed both in interactive mode and in normal script mode. When in interactive mode, you are able to write Python code directly into the a command line shell which gives immediate feedback for each statement. In normal script mode, Python script files (.py) will be executed in the Python interpreter until completion or error.


Tutorials:

https://www.codecademy.com/learn/python

https://developers.google.com/edu/python/

Code Reference:

http://www.cogsci.rpi.edu/~destem/igd/python_cheat_sheet.pdf

https://gist.github.com/filipkral/740a11c827422264c757


Example: If-elif-else Conditional

In [ ]:
temperature = 20 

if temperature <= 0: 
    print "water is ice" 
elif temperature > 0 and temperature < 100: 
    print "water is liquid" 
else:
    print "water will boil"
    
# will print "water is liquid"

Example: For Loop (iterate over list and over range)

In [ ]:
# create list of integers
myNumbers = [-5, -1, 5, 17, 20]

# loop over list
for num in myNumbers:       
    print "the current number is " + str(num)

# loop over range from 1 to 99
for num in range(1,100):       
    print "the current number is " + str(num)

Example: Function Definition and Function Calling

In [ ]:
def calc(x=1, y=2, op = "add"):         

if op == 'add': 
    return x + y    
elif op == 'subtract':        
    return x - y    
else:        
    print 'Valid operations: add, subtract'

calc()                  # uses default value, outputs 3
calc(5, 3, 'add')       # uses arguments, outputs 8

End of Python Basics

Pandas Example: Linear Regression of Life Expectancy for 185 Different Countries


In this example, we will take the GapMinder life expectancy dataset (1916-2015) and perform a linear regression for each of the 185 country within the dataset.

We will then compare the R-Squared of each model to see if a linear model is a good fit for most countries.

To facilitate data analysis, regression and plotting , we will use the ‘pandas’, ‘statsmodel’ and ‘matplotlib’ packages.

A good general overview of the ‘pandas’ package can be found at: http://pandas.pydata.org/pandas-docs/stable/10min.html


Package Imports

In [1]:
    # show plots inline
    %matplotlib inline
    
    import pandas as pd 
    import statsmodels.formula.api as sm
    import matplotlib.pyplot as plt
    
    # set plotting style to 'ggplot'
    plt.style.use('ggplot')
    
    # to surpress a dataframe warning
    pd.options.mode.chained_assignment = None

Reading from CSV File and Basic Analysis

We first retrieve the data into a ‘pandas’ dataframe from a CSV file containing the GapMinder life expectancy dataset.

In [2]:
    # read csv file into dataframe
    df = pd.read_csv('lifeexpectancy.csv') 

We can see that the dataframe stores the content of the CSV file much like a spreadsheet format. Notice that the data is currently stored in a pivoted format, with the years listed across a number of columns.

In [3]:
df.head()
Out[3]:
Country 1916 1917 1918 1919 1920 1921 1922 1923 1924 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
0 Afghanistan 27.022387 27.012140 7.045063 26.991647 26.9814 27.073893 27.166387 27.258880 27.351373 53.2 53.6 54.0 54.5 54.8 55.2 55.5 56.2 56.91 57.63
1 Albania 35.400000 35.400000 19.427932 35.400000 35.4000 35.395040 35.390080 35.385120 35.380160 74.5 74.7 74.9 75.0 75.2 75.5 75.7 75.8 75.90 76.00
2 Algeria 28.287837 28.287837 22.056668 28.287837 27.5000 27.624680 27.465360 29.998040 31.466720 74.8 75.0 75.3 75.6 75.9 76.1 76.2 76.3 76.40 76.50
3 Angola 26.980000 26.980000 10.434273 26.980000 26.9800 27.193360 27.406720 27.620080 27.833440 56.9 57.6 58.3 58.9 59.4 59.7 60.1 60.4 60.70 61.00
4 Antigua and Barbuda 33.536000 33.536000 21.750121 33.536000 33.5360 33.554040 34.399666 35.245292 36.090919 74.4 74.6 74.8 75.1 75.2 75.2 75.2 75.2 75.20 75.20

5 rows × 101 columns

Using the describe() function, we can do basic summary statistics for the life expectancy dataset (by year).

In [4]:
df.describe()
Out[4]:
1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
count 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.00000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000 185.000000
mean 33.977578 33.811350 22.596093 33.925875 34.147988 34.684081 35.07504 35.856659 36.188317 36.395107 69.077297 69.428108 69.742162 70.074595 70.263784 70.649730 70.945946 71.185405 71.426486 71.669189
std 7.934454 8.027239 11.127740 8.478650 9.100745 9.653877 9.61438 9.445528 9.469711 9.618643 9.076710 8.901207 8.744934 8.569927 8.788056 8.319713 8.199794 8.041966 7.888871 7.741353
min 19.000000 20.000000 1.000000 12.000000 15.226000 11.956020 13.91204 23.478440 23.508920 23.539400 44.000000 44.400000 45.200000 46.300000 37.000000 48.300000 48.200000 48.300000 48.400000 48.500000
25% 29.800000 29.400000 12.453061 29.700000 29.200000 29.602840 29.80132 30.352100 30.609399 30.701200 62.500000 62.900000 63.000000 63.500000 63.800000 64.200000 64.600000 64.900000 65.200000 65.500000
50% 32.000000 32.000000 21.750121 32.000000 31.976800 32.019820 32.22680 32.619160 32.800400 32.878300 71.200000 71.300000 72.100000 72.300000 72.400000 72.500000 72.700000 72.700000 72.800000 73.130000
75% 35.600000 35.400000 29.000000 35.500000 35.500000 35.674900 36.67232 37.629471 38.244640 39.030800 75.800000 76.200000 76.500000 76.700000 76.800000 76.900000 77.300000 77.500000 77.700000 77.900000
max 58.393922 59.020000 56.250000 59.981569 60.510784 61.778400 62.89760 62.988000 62.978000 63.239000 82.300000 82.500000 82.600000 82.800000 83.000000 82.800000 83.200000 83.300000 83.400000 83.500000

8 rows × 100 columns


Melting the Data (Unpivoting)

In order to better work with the data, we would like to “unpivot” the data so that each row is a separate record. This will enable us to better perform linear regression and other analysis on the dataset.

We can use the melt() function within ‘pandas’ to help us do this. We set the columns(s) that we want to use as the identifier for each record to the ‘id_vars’ argument of the melt() function. All other column(s) will be melted.

In [5]:
    # melt the normalized file
    le = pd.melt(df, id_vars=['Country'])

    # rename columns and then sort by country name and year
    le2 = le.rename(columns={'variable':'year','value':'life_expectancy'})
    le2 = le2.sort(['Country','year'])
    
    # set year variable as numeric (orginally read in as object from csv)
    le2['year'] = le2['year'].convert_objects(convert_numeric=True)
    
    # check to see if dataframe is properly melted
    le2.head(10)
Out[5]:
Country year life_expectancy
0 Afghanistan 1916 27.022387
185 Afghanistan 1917 27.012140
370 Afghanistan 1918 7.045063
555 Afghanistan 1919 26.991647
740 Afghanistan 1920 26.981400
925 Afghanistan 1921 27.073893
1110 Afghanistan 1922 27.166387
1295 Afghanistan 1923 27.258880
1480 Afghanistan 1924 27.351373
1665 Afghanistan 1925 27.443867

Basic Line Plot

We can then plot the data for one country, for instace Canada to see how the data looks.

In [6]:
    # set plot size
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # filter data for Canada and plot
    countryData = le2[(le2.Country == 'Canada')]
    
    # plot values
    plt.plot(countryData['year'], countryData['life_expectancy'], '.')
    plt.title("Canada - Life Expectancy")
    plt.show()

Simple Linear Regression

We can do a simple linear regression on the selected country’s dataset and then plot the regression line against the actual values. We will use the ‘statsmodel’ ols function perform the linear regression.

In [7]:
    # fit linear regression
    linearModel = sm.ols(formula='life_expectancy ~ year', data=countryData).fit()
    
    # set plot size
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # plot actual values
    plt.plot(countryData['year'], countryData['life_expectancy'], '.')

    # plot predicted values (from fitted regression)
    plt.plot(countryData['year'], linearModel.fittedvalues, 'b')
    plt.title("Canada - Life Expectancy vs Fitted Linear Regression Line")
    plt.show()

We can show the summary of the linear regression using the summary() function.

Looking at the results for Canada, it seems that the linear model is a good fit as the R-squared is very close to 1.

In [8]:
linearModel.summary()
Out[8]:
OLS Regression Results
Dep. Variable: life_expectancy R-squared: 0.959
Model: OLS Adj. R-squared: 0.958
Method: Least Squares F-statistic: 2269.
Date: Wed, 21 Oct 2015 Prob (F-statistic): 1.40e-69
Time: 13:42:44 Log-Likelihood: -191.68
No. Observations: 100 AIC: 387.4
Df Residuals: 98 BIC: 392.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -468.5760 11.318 -41.402 0.000 -491.035 -446.117
year 0.2743 0.006 47.637 0.000 0.263 0.286
Omnibus: 80.354 Durbin-Watson: 0.602
Prob(Omnibus): 0.000 Jarque-Bera (JB): 820.751
Skew: -2.428 Prob(JB): 5.97e-179
Kurtosis: 16.168 Cond. No. 1.34e+05

We can retrieve various values (beta coefficient values, R-squared) from the model using the following commands:

In [9]:
# model coefficents 
linearModel.params
Out[9]:
Intercept   -468.576000
year           0.274273
dtype: float64
In [10]:
# model intercept coefficient
linearModel.params[0]
Out[10]:
-468.57599994618505
In [11]:
# model slope coefficient
linearModel.params[1]
Out[11]:
0.27427257049024917
In [12]:
# model R-squared value
linearModel.rsquared
Out[12]:
0.95860297854064103
In [13]:
# model coefficient p-values
linearModel.pvalues
Out[13]:
Intercept    6.846878e-64
year         1.398615e-69
dtype: float64
In [14]:
# model predicted values
linearModel.predict()
Out[14]:
array([ 56.93024511,  57.20451768,  57.47879025,  57.75306282,
        58.0273354 ,  58.30160797,  58.57588054,  58.85015311,
        59.12442568,  59.39869825,  59.67297082,  59.94724339,
        60.22151596,  60.49578853,  60.7700611 ,  61.04433367,
        61.31860624,  61.59287881,  61.86715138,  62.14142395,
        62.41569652,  62.68996909,  62.96424166,  63.23851423,
        63.5127868 ,  63.78705938,  64.06133195,  64.33560452,
        64.60987709,  64.88414966,  65.15842223,  65.4326948 ,
        65.70696737,  65.98123994,  66.25551251,  66.52978508,
        66.80405765,  67.07833022,  67.35260279,  67.62687536,
        67.90114793,  68.1754205 ,  68.44969307,  68.72396564,
        68.99823821,  69.27251079,  69.54678336,  69.82105593,
        70.0953285 ,  70.36960107,  70.64387364,  70.91814621,
        71.19241878,  71.46669135,  71.74096392,  72.01523649,
        72.28950906,  72.56378163,  72.8380542 ,  73.11232677,
        73.38659934,  73.66087191,  73.93514448,  74.20941705,
        74.48368962,  74.75796219,  75.03223477,  75.30650734,
        75.58077991,  75.85505248,  76.12932505,  76.40359762,
        76.67787019,  76.95214276,  77.22641533,  77.5006879 ,
        77.77496047,  78.04923304,  78.32350561,  78.59777818,
        78.87205075,  79.14632332,  79.42059589,  79.69486846,
        79.96914103,  80.2434136 ,  80.51768618,  80.79195875,
        81.06623132,  81.34050389,  81.61477646,  81.88904903,
        82.1633216 ,  82.43759417,  82.71186674,  82.98613931,
        83.26041188,  83.53468445,  83.80895702,  84.08322959])

Split-Apply-Combine

Now that we have the linear regression model for a single country (Canada) and can see that it is a reasonable model to use, it would be nice to be able to see if this holds true for all countries.

It would be very tedious to run a separate linear regression model and plot for each of the 185 countries in the dataset.

Fortunately, we can use the ‘group-by’ function in ‘pandas’ to help us. This function will help us to:

(1) Split the data into separate groups (for each country)

(2) Apply a function (such as a linear regression) for each of the grouped datasets

(3) Combine the results of each of the groups into a single dataframe for straight-forward analysis.


Let’s use this functionality to run a regression for all of the countries separately and then compare the R-squared for all the models in a single plot.

In order to do this, first we need to define a function that can be applied to each of the grouped datasets.

This function (to perform a linear regression and return the R-squared value of the model) is defined below:

In [15]:
    def getLinearModelRSquared(modelData):
        linearModel = sm.ols(formula='life_expectancy ~ year', data=modelData).fit()
                             
        return linearModel.rsquared          

We can now use the ‘groupby’ function to apply our new function to the dataset for each of the 185 countries. We will re-use the original ‘le2’ dataframe that has all the melted data for all of the countries in this step.

In [16]:
    # group dataframe by country (split)
    grouped = le2.groupby('Country')
        
    # apply function to each dataset and combine results
    rSquaredValues = grouped.apply(getLinearModelRSquared)

    # show the first 5 computed R-Squared values
    rSquaredValues.head()
Out[16]:
Country
Afghanistan            0.940734
Albania                0.903918
Algeria                0.975977
Angola                 0.954742
Antigua and Barbuda    0.903550
dtype: float64

We now have the R-Squared values for all of the 185 linear regressions, one for each country’s data set. We can now plot the values into a histogram to see how the models R-Squared values are distributed.

In [17]:
    fig, ax = plt.subplots(figsize=(10, 6))
    rSquaredValues.plot(kind='hist', ax=ax, bins=20)
    plt.title("R-Squared Values for Linear Regression Models")
    
    plt.show()

We can now see that R-squared values for most of the countries are fairly close to 1, so it seems that a linear model is reasonable for many countries.

If we sort our dataframe containing all the R-squared values, we will be able to see which countries have the best fit and which countries have the worst fit using a linear model.

In [18]:
rSquaredValues.sort()
rSquaredValues.tail()
Out[18]:
Country
Bhutan         0.980740
South Korea    0.981112
Senegal        0.981843
Bolivia        0.983170
Nicaragua      0.984132
dtype: float64
In [19]:
    # filter data for Nicaragua (best fit based on R-squared)
    countryData = le2[(le2.Country == 'Nicaragua')]

    # fit linear regression
    linearModel = sm.ols(formula='life_expectancy ~ year', data=countryData).fit()
    
    # set plot size
    fig, ax = plt.subplots(figsize=(10, 6))
    
    plt.plot(countryData['year'], countryData['life_expectancy'], '.')
    plt.plot(countryData['year'], linearModel.fittedvalues, 'b')
    plt.title("Nicaragua  - Life Expectancy vs Fitted Linear Regression Line")
    plt.show()
In [20]:
rSquaredValues.head()
Out[20]:
Country
Lesotho      0.505176
Zambia       0.533450
Belarus      0.548027
Swaziland    0.570214
Zimbabwe     0.582147
dtype: float64
In [21]:
    # filter data for Lesotho (worst fit based on R-squared)
    countryData = le2[(le2.Country == 'Lesotho')]

    # fit linear regression
    linearModel = sm.ols(formula='life_expectancy ~ year', data=countryData).fit()
    
    # set plot size
    fig, ax = plt.subplots(figsize=(10, 6))
    
    plt.plot(countryData['year'], countryData['life_expectancy'], '.')
    plt.plot(countryData['year'], linearModel.fittedvalues, 'b')
    plt.title("Lesotho - Life Expectancy vs Fitted Linear Regression Line")
    plt.show()
End of ‘pandas example’

In-Class Assignment

Please complete the following as part of your work for this class. There are 5 questions and 1 optional bonus question.

You should be able to complete the assignment using materials learnt from the presentation and from the ‘pandas example: linear regression of life expectancy for 185 different countries’ that is part of this document.

Submit the assignment answers as a well-commented Python script named: PandasClass-[YourStudentId].py


Question 1 (conditions, lists, functions and loops)

Write a Python function that will return “odd” if an input number is odd and “even” if an input number is even. One method could be to use the ‘modulo’ or ‘%’ operator as part of a comparison.

Use a ‘for’ loop and a manually created list of integers to test your function for the following 8 integers by printing the results of your function:

-517, -212, -14, 0, 3, 28, 421, 1500


Use the ‘gdppercapita.csv’ file as an input for the following questions. This CSV file contains the GDP per capita for 185 countries in tabular/pivoted form.

Be sure that the CSV file is in the same working directory as your Python script or else it will not be able to find the file when you run your script.

You will likely need to also import the ‘pandas’, ‘statsmodel’ and ‘matplotlib’ packages to complete the following questions.


Question 2 (pandas, file reading and melting)

Read in the ‘gdppercapita.csv’ file into a ‘pandas’ dataframe, melt the data and rename the columns so that the dataframe columns looks like this:

Country | year | gdp_per_capita

You will also need to convert the year column into a numeric type for further analysis.

Print out the summary statistics (mean, standard error, range) for the gdp_per_capita column.


Question 3 (linear regression and plotting)

Filter the dataset for a country of your choice and run a linear regression of gdp_per_capita against the year.

Print the summary for your regression model and then plot the linear regression line (predicted values) versus the actual values.

Write in a single-line python comment whether your country’s dataset fits a linear model well.


Question 4 (split-apply-combine method)

Write a generic function that will return the ‘slope coefficient’ for a linear regression of gdp_per_capita against the year for any input dataset.

Using the split-apply-combine method, apply your generic function to create a dataframe containing the ‘slope coefficient’ values for 185 linear models (1 for each of the 185 countries in your melted dataset).

Plot the ‘slope coefficient’ values into a histogram so that you can better review the distribution.

Write in a single-line pythom comment what the distribution of ‘slope coefficients’ looks like across all the models.


Question 5 (dataframe sorting)

Sort your dataframe containing the ‘slope coefficient’ values so that you can see what countries have the greatest and lowest ‘slope coefficients’.

Plot on the same figure, the regression line for the country you choose in Question 3 and the country that has the highest ‘slope coefficient’ in the entire GapMinder dataset. This will allow you to compare the models of the two countries easily.

(If your original selected country also has the highest ‘slope coefficient’, then plot it against the country with the lowest ‘slope coeffiecient’.


Question 6 (optional bonus)

Apply your python and pandas skills to investigate both the ‘gdppercapita.csv’ and ‘lifeexpectancy.csv’ datasets.

See if you can find or plot some interesting results from a combination of the two datasets (maybe scatterplots). You may need to use outside resources for this question if you try it.