Our analysis began with the data collection parsing and consolidating steps explained on the previous page. On this page, we will discuss the further analysis steps we took in using Python and ArcMap. Additionally, we will cover some challenges we encountered with our analysis method and how we carried forward. On the next page, we discuss our findings after our analysis was complete.
Step 1: Importing into Arcmap Pro and joining data
Firstly we needed to connect all our demographic and CT data files to perform analysis.
- The school data presented the most significant challenge. Because the school data we had collected for enrollment statistics post-graduation was based on school district boundaries, not on CT, we needed to perform several conversions:
- We first joined school CSV data to school district shapefiles.
- We then clipped the school districts down to the ones we needed for our analysis to the shapefile of Chicago.
- Then we created centroid points for each CT and performed a spatial join using any CT within a school district into the attribute table of that district.
- This allowed us to connect the two datasets.
- We then performed a join back to CT, so the school district was present and represented by CT.
As such, multiple census tracts contain the exact same school-related data. This is the cause of resultant striated scatter plots where the dependent variable is related to a school, but the points are plotted by census tract.
Finally, we exported the data table, which now contained all the CT level data like CT ethnicity statistics, income stats, school distract stats, enrollment stats, and attained higher education stats to be analyzed in Python.
Considerations
Upon joining all the desired fields of data to census tract names, many of the fields (columns) contained empty rows. Python OLS functions do not take missing values and result in an error if they are input. To remove entire CTs from the data set when one of their fields was missing resulted in too few remaining CTs to create meaningful results. To minimize the number of rows that needed to be deleted, some data was manually entered based on intuition and comparison between fields. For example, large swaths of the city did not contain values representing the proportion of the population that is Asian. Upon reviewing the census tract demographic data, it was found that the tracts in that area themselves were composed of well less than 1% Asian population. As such, values of zero were put in place of the null values in the proportion of students who are Asian field.
Step 2: OLS Regression Python
To determine the variables in our dataset that had the most significant effect on the college enrollment rate of high school graduates, an OLS regression was run in python. The dependent variable “college enrollment within 16 months of graduation” was measured against independent variables: “Median Income /1000”, “Proportion of population black”, “Diversity”, “Higher Education”, and “Proportion of population ESL”.
OLS Function:
import seaborn as sns import statsmodels.api as sm import numpy as np from sklearn.linear_model import LinearRegression def ols(df, x: str, y: str): #clean_dataset(df) dataframe = df independent_1 = dataframe[x] dependent_1 = dataframe[y] #Graph: sns.regplot(x= independent_1, y= dependent_1) #------------------------------------------------------------- #OLS and long regression results: x = np.array(independent_1).reshape((-1, 1)) # the x y = np.array(dependent_1) #the y x, y = np.array(x), np.array(y) # the x, y x = sm.add_constant(x) model = sm.OLS(y, x) #model for ols using sm results = model.fit() #fitting the model for ols print(results.summary()) #printing results of ols #------------------------------------------------------------ #Linear Regression coefficient of determination results: model1 = LinearRegression() #model for r2 from sklearn.linear_model import LinearRegression model1.fit(x, y) # fitting model for r^2 r_sq = model1.score(x, y) # finding r^2 print('coefficient of determination:', r_sq)
Step 3: Visual Analysis using Plotly
After performing the OLS in Python, we also examined the data using Plotly express, a visualization package in Python, to better understand possible connections between the different variables. Plotly allowed us to plot a scatterplot representing three data sets using a colour range to represent the 3rd data set. This gave us a better understanding of how each variable may be interacting with and impacting the other.
Plotly Function:
import plotly.express as px def show_scatterplot(df, y_var, x_var, other_var): plot = px.scatter(df, x = x_var, # x axis y = y_var, # y axis trendline="ols", #trendline type color = other_var, #Color Based on Rank color_continuous_scale = px.colors.sequential.Darkmint) plot.update_layout( title_text = (x_var + ' vs. ' + y_var), title_x=0.5) #title plot.update_traces(textposition = 'top center') results = px.get_trendline_results(plot) px_fit_results.iloc[0].summary() plot.show()
GIS Regression Analyses
The CT CSV file can be thought of as two different categories of data: school-related and census tract related.
The first regression that we performed (despite nominal values found by python) was a geographically weighted regression within the Arcmap using high school graduate post-secondary enrollment rate as the dependent and diversity. It represented the median income, ESL, and higher education as explanatory variables (note: median income and college enrollment were first reduced to a value between 0 and 1 because the other variables were already in this format). This GWR was conducted to find localized significance where it was not found globally. The coefficient of each variable of determination was mapped on the same scale to be compared side by side, as seen in the findings section.
We conducted a second GWR to explain census tract variables with census tract data where the dependent variable was Higher Education of CT. The independents were: income, ESL, diversity, proportion white.
***It is important to caution that the findings of this GWR potentially do not explain the influence of these variables on college enrollment rates. Higher education represents the population with attained degrees and has no control for people moving to and from CTs. In this case, income goes from being a driver of higher education to being the result of it. It becomes possible to move to more expensive neighbourhoods given better jobs, raising the CTs percentage of residence with diplomas.
The last GWR we conducted was comparing the different school-related variables. The dependent variable being college enrollment once again, and the explanatory variable being simply the proportion of students of low-income families. Interestingly, other variables, such as the proportion of white students and the student drop out rate, resulted in levels of multicollinearity to the point where the model did not run.