Methodology

Methodology

Spatial Unit

CalEnviroScreen 4.0 includes the entire state of California, with the spatial unit being the census tract. To reach the desired geographic scope, layers were clipped with the boundaries of the respective counties that make up each metropolitan area. 

Multivariate Cluster Analysis

This was conducted using the Spatially Constrained Multivariate Clustering tool on ArcGIS, to determine how the relative ratio of different air pollutants differs between communities. There were no cluster size constraints or number of clusters specified; these parameters were left as the default. Spatially constrained multivariate clustering uses unsupervised machine learning methods to determine natural clustering, and group so that features within each cluster are as similar as possible (ESRI). Boxplots were generated to visualize the properties of the clusters.

Average Air Pollution Percentile

Although CalEnviroScreen has a score for all pollution, it doesn’t have one for only air pollution. Thus, we decided to produce a metric of the communities that have the highest air pollution burden of the combined 5 air pollutants. Using calculate field on ArcGIS, we created a new field and performed addition to get a sum of the percentiles. Then, another new field was made to divide the sum found previously by the number of variables (5 in this case) to find the mean.

Regression Analysis

We conducted an Exploratory regression to determine the highest correlated variable, and then Generalized Linear Regression, Spatial Lag Regression, and Geographically Weighted Regression to determine the best model.

Exploratory Regression and Selection of Model

We used exploratory regression to determine a fitting model for each disease. The dependent variables are asthma and cardiovascular disease. Since each regression can only run one dependent variable at a time, each disease had its own regression. The exploratory regression gave 5 of 5 variables the highest R-squared and lowest AICc. Initially, we were going to incorporate all 5 variables into our model, but for the GWR analysis to run on ArcGIS without errors (discussed in later steps), it required no multicollinearity, which is redundant values. GWR would only run with 1 variable, after testing every combination of variable to see if GWR would run. We reframed our analysis to analyze the top explanatory variable for each disease.

Generalized Linear Regression

For the top variable identified in the exploratory regression, GLR was run on ArcGIS. GLR is also called the ordinary least squares regression (OLS), which is an aspatial method.

Spatial Lag Regression

Spatially lagged regression is a variant of OLS that includes a variable that reflects the neighboring conditions. It tells us how local the conditions are by taking into account spatial effects. This regression was conducted on GeoDa. The spatial weights file used a Queen contiguity and an order of 1, which means all the neighbors around a particular census tract for 1 iteration.

Geographically Weighted Regression

GWR is a variant of OLS but conducted at a local scale, with a regression equation for every census tract in the dataset. It tells us about the spatial autocorrelation of the dependent variable. GWR was conducted on ArcGIS. The model type is continuous/Gaussian, and the neighborhood type is number of neighbors. The Golden Search Method was used to let the computer decide on the ideal number of neighbors.