Statistics

Statistics, 26 January 2015

Statistics are a useful tool for GIS because it provides methods to explore data, understand patterns and relationships, and predict the future.  Data can classified as nominal, ordinal, interval or ratio, and can also be examined as samples or as populations.

There are many ways to deal with a set of data Examinations could be performed graphically or numerically.  Data may also be derived (e.g. percentages) as opposed to raw numbers.

When looking to summarize data, there are many options, depending on the type of information you are looking at.  The following are some of the ways to summarize a set of data:

  • Measures of central tendency: mean, median, mode, etc.
  • Measures of skewness: if you plot it, are there are uneven tails on either side?
  • Measures of kurtosis: whether the distribution has a high peak or a low peak
  • Z-score: a measure of the relationship between a score and the mean of the data set
  • Arithmetic mean: average
  • Geometric mean: for percentage data.
  • Harmonic mean: average of rates. N over (sum of (1/variable))

As well, one can look at the relationships between values in a data set.  For these, you can use measures of association, such as various types of correlation:

  • Pearson’s R
  • Spearman’s R
  • Crosstabulation
  • Chi square statistics

Another way to look at relationships is through regression analysis. One common form of regression analysis is Ordinary Least Squares, which attempts to minimize the sum of squared errors.  The residuals are squared and then summed.

There are several issues to take note of when modeling data.  Ideally, you want simplicity and parsimony, to explain the most with the least number of variables. You must watch out for multi-collinearity: having two or more variables representing the same thing.  You should also look at whether your data is homoskedastic (residuals are scattered evenly along a straight horizontal line) or heteroskedastic (there is variability in the residuals across the range).  One other important factor, when looking at data involving geography is autocorrelation.  In general, most geographic data is spatially autocorrelated, so in performing analyses, you can use spatial declustering or other methods to reduce bias.

When determining which model to use, look at both the R Squared value and the information content (AIC).  The AIC will tell you how much is being explained by the variables, so you can weigh complexity of the model with explanation.

Here is some regression analysis terminology:

 

  • Simple linear: model uses just x and y
  • Multiple: multiple independent variables
  • Multivariate: multiple dependent and independent variables (canonical analysis)
  • SAR (Spatial Autoregressive model)
  • CAR (Conditional Autoregressive model)
  • Logistic: binary data
  • Poisson: count data
  • Ecological: how do we better predict what individual has, regardless of the group?
  • Hedonic: determine the value of something by assigning value to attributes of that thing
  • Analysis of variance: analysis of differences between means
  • T Test: difference between means
  • Analysis of covariance:analysis of the linear relationship between variables

One way to account for spatial autocorrelation and other issues introduced by geography is to use methods such as Geographically Weighted Regression (GWR).  These models account for local variation,as opposed to creating a global model.  One other method of avoiding the problems of spatial autocorrelation is to look at bandwidth.  If you look at small areas, they will display less correlation than the larger area generally.

To test the statistical analysis you have performed, you can use sensitivity analysis to see how the data responded to binary analyses.  This can include models, decay function, bandwidth, and point or centroid selection.

Leave a Reply

Your email address will not be published. Required fields are marked *