Statistics — Part II

Statistics Part II — 28 January 2015

Simple linear regression has one dependent variable and one or more independent variables.  They are expressed as the linear equation  y = a + bx.  The method creates a best-fit line based on the lowest sum of squared residuals.

Multiple regression is useful when there are several factors that affect the independent variable.  It is modeled with an equation where each X has its own coefficient, plus a random error term.

R squared is the correlation coefficient representing the residuals, and demonstrates the strength of the relationship and the validity of the model.

There are several ways to examine the utility of the model.  For example, the P values demonstrate the relevance of each variable, while F values show the significance of the model as a whole.  AICc values also demonstrate model simplicity and parsimony.  You should also look at multicollinearity to determine whether your independent variables are related to one another or are telling the same story.  A model could also be demonstrating endogeneity if there are circular or backwards relationships.  As well, it is important to determine the level of specificity, as a model may be biased by eliminating a variable.

Geography plays a very large role in statistical analysis of data.  As previously mentioned, spatial autocorrelation and MAUP are issues that the researcher should take into account.  Models such as Geographically Weighted Regression create local models that take geography into account. (Ordinary Least Squares, while it is probably the best known type of regression analysis, is useful for non-spatial data, because it provides a global model.)  Manipulation of bandwidth, kernel types, and other parameters can give more control over the creation of these local models.

Leave a Reply

Your email address will not be published. Required fields are marked *