Monthly Archives: February 2014

Effects of dimensionality on distance and probability density in climate space

Summary

This post explores the effects of the so-called “curse of dimensionality” on Mahalanobis distance metrics. In simulated and real data, I demonstrate that the meaning of Mahalanobis units changes with increasing dimensionality: observations recede away from any reference coordinate due to the progressive exclusion of data space from the unit sphere. The result is that Euclidean and Mahalanobis distance units do not provide a direct measure of probability density in more than one dimension. It turns out that these dimensionality effects are the essence of the statistical theory of the chi-square distribution.  A meaningful distance/dissimilarity metric can be created by dividing distance by the square root of the dimensionality of the data space, which appears to create a reasonably intuitive metric based on the probability density of multivariate normal data.

Introduction

The premise of my research so far has been that the historical range of interannual climatic variability provides an ecologically meaningful metric of climatic differences over space and time. Building on the standardized Euclidean distance approach of Williams et al. (2007), I have developed a Mahalanobis metric based on the average standard deviation of historical interannual climatic variability across representative locations in any given study area. My intent is to use this metric to measure the degree to which future climatic conditions differ from those found in the current climatic classification. in other words, I intend to use the standard deviation of the historical range of variability as a unit of distance between future conditions and present conditions.

The first indication that something was missing from this approach was when I made maps of the distance from 1971-2000 normals to the 1971-2000 normals for a single location representing a BEC variant. For example, in the climatic distance map for the CDFmm (Figure 1), a very limited area is within 1 unit of the centroid, and most of the CDFmm is outside 2 units. If 1 M unit represents a standard deviation of temporal climatic variability, I would have expected most of the climate normals in the Georgia Basin to be within 2, if not 1, standard deviations from the centroid location.

Two reasons came to mind for the apparent overestimation of climatic distance: The first potential reason is overfitting. I used 10 principal components, and I standardized all the PCs to achieve a Mahalanobis metric. Given that the 10th PC is given the same weight as the first, some amount of overfitting might be expected. However, I demonstrated in my last post that overfitting due to retained dimensions was not apparent in climate year classification results. Furthermore, adding dimensions with low spatial variability should reduce a Mahalanobis dissimilarity estimate, i.e. do the opposite to what seems to be the problem. The other possible reason is the “curse of dimensionality.” I looked into it, and it turns out that my thinking about distance was neglecting some fundamental geometric realities.

Figure 1: Mahalanobis distance from the 1970-2000 climatic normals in the Pacific Northwest to a location representing the average climate of the CDFmm.

How does the map change with different numbers of PCs?

The obvious firsts step is to see how the map changes at different levels of dimensionality (numbers of PCs retained) (Figure 2). One principal component is obviously too little information: false analogues are abundant in the interior rainshadow. The 2-PC climate space yields much better results, as would be expected because there is another key dimension of climatic variability to measure distance in. The 4 and 10-PC climate spaces show the core areas becoming progressively more distant, even though the outer limits of the range of analogues contracts only subtly. Given that my previous posts have demonstrated that the spatial variation in climate is predominantly captured in the first three PCs, this non-proportionate “distancing” phenomenon is suspect.

Figure 2: climatic distance to the CDFmm climatic centroid at different levels of dimensionality.

Another way to see the effect of dimensionality is in histograms of the distance of individual climate years to their own centroid (Figure 3). In one dimension, the distance to centroid takes the expected half-normal distribution, in which most climate years are within one M unit (i.e. one standard deviation) of their centroid, and almost all are within two M units. However, this distribution becomes altered in climate spaces of more than one dimension. The distances to centroid increase at higher dimensions, to the extent that in 10 dimensions there are no climate years within 1 M unit of the reference condition (i.e. the origin of the climate space). This indicates that the meaning of distance and probability density changes with increasing dimensionality.

Figure 3: effect of dimensionality on the Mahalanobis distance of CDFmm 1951-2010 climate years to their own centroid.

The shift in the distribution of distances between the mean of a sample and its observations has a very simple and intuitive geometric cause. Figure 4 is a plot of the CDFmm climate years in the first two PCS. By definition, about two-thirds of climate years lie within one standard deviation of PC1 (the area shaded gray). However, in two-dimensional space, a much smaller proportion of climate years are within a distance of one M unit of the centroid, i.e. within the unit circle. It follows that in three dimensions, even fewer observations would be within the unit sphere.

Figure 4: 2D demonstration of the reason for the exclusion of observations from the region of 1 standard deviation around the mean.

Simulations

The relationship between distance, probability density, and dimensionality can be explored through simple simulations of random data. The first simulation is simply a multivariate normal distribution with a standard deviation of one in each of i dimensions. At dimensionality higher than one, distances to centroid rapidly approach a Gaussian-like distribution with a standard deviation of 0.707 (square root of 2). This distribution is maintained at dimensionality higher than 3, even though the distance from the centroid increases. As a result, the ratio of the distance to centroid of the closest and furthest observations approaches 1 at very high dimensionality. The explicit description of this effect is generally attributed to Beyer et al. (1999).

Figure 5: distance of observations of a simulated random normal sample (N=1000, sd=1) from their own centroid.

Simulation 2 explores the case of two random normal samples with the same centroid but different standard deviations (Figure 6). Although the two samples overlap in low dimensional space (<4 dimensions), they diverge from the mean (the origin) at different rates such that they are completely separate in high dimensional space. The standard deviation of distances to centroid is the standard deviation of the univariate data distribution divided by the square root of two (Figure 7). The mean distance to centroid is standard deviation of the data multiplied by the square root of the dimensionality.

Figure 6: Simulation 2—Effect of dimensionality on distance to centroid in random multivariate normal data with the same mean but different standard deviations.

Figure 7: Simulations used to infer geometrical relationships between dimensionality and the distribution of distances to centroid.

Simulation 3 adds an additional multivariate normal sample with a standard deviation of 1 and a mean of 3 in all dimensions (Figure 8). The distance from the origin to these observations (red distribution) is slightly greater than that of the multivariate normal sample with a mean of zero and standard deviation of 3 (grey distribution), though the mathematical relationship is not immediately clear. The standard deviation of the red sample is slightly less than one (0.98) in all levels of dimensionality. This is likely because the tail of the univariate distribution of this sample crosses the origin, resulting in a slightly folded distribution of distances to the origin. Nevertheless, this simulation indicates that dispersion (standard deviation of samples) is preserved in multidimensional distances.

Figure 8: Simulation 3—Effect of dimensionality on distance to an additional multivariate normal sample with a standard deviation of 1 and a mean of 3 in all dimensions.

The last simulation investigates the effect of a sample mean being different in only one of many dimensions (Figure 9). In this case, the mean of distances of the different sample (red) approaches that of the reference sample (blue) in higher dimensionality. In other words, the distance between samples is influenced by the number of dimensions in which the difference between the samples occurs.

Figure 9: Simulation 4—Effect of dimensionality on distance to the origin in random samples with the same standard deviation but different means. The red sample has a mean of 3 in the first dimension, and a mean of zero in all other dimensions.

The simulations suggest the following attributes about the relationship between distance, probability density, and dimensionality:

  • The relationship between distance and probability density is non-stationary under varying dimensionality. i.e. the probability of an observation occurring within a distance of one standard deviation of the sample mean decreases as the dimensionality increases.
  • The probability distributions of multivariate normal data are hollow hyperspheres. The mean of the hyperspherical probability distribution is located at a distance from the centroid equaling the standard deviation of the multivariate normal data multiplied by the square root of the dimensionality. The standard deviation of the hyperspherical probability distribution is equal to the standard deviation of the multivariate normal data divided by the square root of two.
  • The overlap between probability density distributions of multivariate normal data with different dispersion (i.e. different standard deviations) approaches zero with increasing dimensionality. In other words, given two different samples of the same mean but different standard deviations, the probability that observations from these two samples could occur in the same location approaches zero as dimensionality increases.
  • The dispersion of probability density is preserved in high-dimensional space.
  • The influence that the data distribution in any one dimension has on the total probability density of distance to centroid decreases with increasing dimensionality.

The geometrical phenomena described above are the essence of the statistical theory of the chi-square distribution. The squared Euclidean distance of standard (i.e. unit variance) multivariate normal data to their own centroid approach a chi-square distribution with degrees of freedom equalling the dimensionality of the data space (Wilks 2006). It follows that the squared Mahalanobis distance to centroid of any multivariate normal distribution will follow a chi-square distribution.

The effects of dimensionality on distance have also been studied in the data mining literature (e.g. Brin 1995, Beyer et al. 1999, Aggarwal et al. 2001), though often without recognition of the link to the chi-square distribution. In this context, they are typically referred to as an aspect of the “curse of dimensionality”, which refers to the exponential increase in data sparsity with increasing dimensionality.

Dimensionality effects are not discussed in the original formulation of the Standardized Euclidian Distance (SED) metric of novel and disappearing climates (Williams et al. 2007) nor in its regional applications (Veloz et al. 2011, Ackerly 2012, Ordonez and Williams 2013), even though dimensionality is likely an important determinant of the novelty threshold (SEDt) that is central to the metric.

Dimensionality effects in the spatiotemporal climate space of the BEC system.

How do these dimensionality effects play out in the spatiotemporal data set I am using for my analysis? A distance histogram of BC climates relative to the CDFmm variant (Figure 10) is quite revealing.

The mean of the probability distribution of CDFmm climate years is approximately the square root of the dimensionality. This would be expected, since the purpose of spatiotemporal standardization is to normalize temporal variability at each BEC variant centroid to approximate multivariate normality with a standard deviation of one. However, the distance to other BEC variants is relative stable at dimensionality more than two. The reason for this is likely because the first two PCs contain the vast majority of the spatial variation of the data (Figure 11). The other dimensions primarily represent different modes of temporal climatic variation, and therefore all variation in these lesser dimensions has approximately unit standard deviation. The result is that the relative distance between CDFmm climate years and those of other BEC variants decreases at higher dimensionality.

Figure 10: distance to the centroid of the CDFmm 1951-2010 climate years. Coloured histograms are 1951-2010 climate years for BEC variants representing similar (CWHxm1), somewhat different (CWHvh1), and very different (ESSFwk1) climates. The grey histogram is the distribution of 1971-2000 climate normals for 168 BEC variant representative locations.

Figure 11: One-dimensional distance to the centroid of the CDFmm 1951-2010 climate years, for selected principal components. Coloured histograms are 1951-2010 climate years for BEC variants representing similar (CWHxm1), somewhat different (CWHvh1), and very different (ESSFwk1) climates. The grey histogram is the distribution of 1971-2000 climate normals for 168 BEC variant representative locations.

A solution: Dimensionality-adjusted Mahalanobis distance

My main interest in the relationship between distance, probability density, and dimensionality is in finding a good way of measuring differences between climatic conditions.  My goal is to achieve a unit of distance in climate space that reflects the scale of interannual climatic variability at any given location. the problem with the dimensionality effects is that they distort the meaning of the distance metric in terms of the probability density of variability around a reference condition. One way to correct for this aspect of the problem is to standardize distances so that the probability distribution of the interannual variability in climate-year distances from the climate normal of the reference location has a mean of one at any level of dimensionality.

Spatiotemporal PCA (stPCA) creates a data space in which the reference period climate years at selected locations follow an approximately standard multivariate normal distribution, assuming normality of the raw variables. As a result, the squared distances of the climate years to their average (the reference normal) follow a chi-square distribution with degrees of freedom equalling the dimensionality of the data space. The mean of the standard chi-square distribution equals the degrees of freedom and thus the dimensionality. Hence, dividing distances by the square root of the dimensionality will center the climate years for any location at a value of one. Further, the units of this dimensionality adjusted distance metric are interpretable as the standard deviations of spherical multivariate normal distributions. Compared to unadjusted distance, this metric appears to be more easily interpreted because it is intuitively related to univariate dispersion.

Figure 12 illustrates the effect of standardizing distances in climate space by the square root of the dimensionality. This standardization has the desirable effect of limiting the probability distribution of the climate years for the focal geographic location (the CDFmm in this case) to a mean of one. It also has the effect of compressing the distributions of other BEC variants as dimensionality increases. This compression is a result of there being little spatial variation in dimensions above the third PC (Figure 11), i.e. all BEC variants have the same mean in these lesser eigenvectors. Hence the average dispersion of the data declines with increasing dimensionality. Adding dimensions for which all group means are the same will make the groups appear more similar, and thus reduce the distances between them. This is a much more intuitive result than the unadjusted Mahalanobis distance.

Figure 12: Effect of transforming distances to the CDFmm centroid by dividing them by the square root of the dimensionality. This transforms the Mahalanobis metric into units of standard deviations of multivariate normal data

Figure 12: Effect of transforming distances to the CDFmm centroid by dividing them by the square root of the dimensionality. This transforms the Mahalanobis metric into units of standard deviations of multivariate normal data

Chi-square percentiles as an alternate distance metric

Since squared distances of climate years to their own average are expected to follow a chi-square distribution, it is logical that climatic dissimilarity could be measured in chi-square percentiles. This approach is demonstrated in Figure 13. Chi-square percentiles clearly provide a statistically precise distance metric. However, the horizon of the metric is very close to the origin: even a climate as similar to the CDFmm as the CWHdm is beyond the 99.99th percentile. This likely limits the utility of chi-square percentiles for climate change analysis.

 

Figure 13: Utility of chi-square percentiles as a dimensionality-independent distance metric.  The squared Mahalanobis distances of climate years to their own centroid follow a chi-square distribution, the percentiles of which provide a statistically precise distance metric. The horizon of this distance metric, however, is very short.

Figure 13: Utility of chi-square percentiles as a dimensionality-independent distance metric. The squared Mahalanobis distances of climate years to their own centroid follow a chi-square distribution, the percentiles of which provide a statistically precise distance metric. The horizon of this distance metric, however, is very short.

Distinction between distances to climate year distributions vs climate normals

When using the dimensionality-adjusted Mahalanobis distance as a dissimilarity metric, it is important to be mindful of the distinction between distances to climate year distributions vs climate normals. This is the distinction between the mean distance to a set of observations vs. the distance to the mean of those observations. As the mean of a standard multivariate normal distribution approaches a reference point, the distance between these two points will obviously approach zero.  However, the mean dimensionality-adjusted Mahalanobis distance from any reference point to standard multivariate normal data can never be less than one, because of the inherent dispersion of the data (Figure 14). It follows from this simple logic that a distance of one has different meanings for climate year distributions vs. climate normals.

Figure 14: distinction between the mean distance to observations in a sample, as opposed to the distance to the mean (centroid) of the sample. This relationship is independent of dimensionality.

Figure 14: distinction between the mean distance to observations in a sample, as opposed to the distance to the mean (centroid) of the sample. This relationship is independent of dimensionality.

 

References

Ackerly, D. D. 2012. Future Climate Scenarios for California: Freezing Isoclines, Novel Climates, and Climatic Resilience of California’s Protected Areas. Page 64.

Aggarwal, C. C., A. Hinneburg, and D. A. Keim. 2001. On the Surprising Behavior of Distance Metrics in High Dimensional Space. Pages 420–434 Lecture Notes in Computer Science.

Beyer, K., J. Goldstein, R. Ramakrishnan, and U. Shaft. 1999. When is “Nearest Neighbour” Meaningful? Int. Conf. on Database Theory.

Brin, S. 1995. Near Neighbor Search in Large Metric Spaces. Pages 574–584 Proceedings of the 21st VLDB Conference. Zurich, Switzerland.

Ordonez, A., and J. W. Williams. 2013. Projected climate reshuffling based on multivariate climate-availability, climate-analog, and climate-velocity analyses: implications for community disaggregation. Climatic Change 119:659–675.

Veloz, S., J. W. Williams, D. Lorenz, M. Notaro, S. Vavrus, and D. J. Vimont. 2011. Identifying climatic analogs for Wisconsin under 21st-century climate-change scenarios. Climatic Change 112:1037–1058.

Wilks, D. S. 2006. Statistical Methods in the Atmospheric Sciences, Second Edition. Page 627. Internatio. Academic Press.

Williams, J. W., S. T. Jackson, and J. E. Kutzbach. 2007. Projected distributions of novel and disappearing climates by 2100 AD. Proceedings of the National Academy of Sciences of the United States of America 104:5738–42.

 

Cross-validated comparison of LDA vs. spatiotemporally standardized PCA

Summary

In this post, I use cross-validated climate year classifications to compare the effectiveness of linear discriminant analysis (LDA) vs. nearest neighbour classification on spatiotemporally standardized principal components (ST PCA). Despite achieving lower correct classification rates on training data, ST PCA performed as well as LDA on test data. This supports theoretical reasoning that ST PCA should be less prone than LDA to overfitting due to the structure of its eigenspace. Neither method showed evidence of overfitting at high dimensions, a perplexing result that deserves further consideration. These results suggest that the benefits of ST PCA over LDA—i.e. a more logical climate space for climate change analysis—do not come at the cost of lower classification effectiveness.

 

Introduction

In previous posts, I developed the methodology of spatiotemporal standardization (STS) as a way of structuring the climate space of British Columbia around historical variability and the differences between BEC variants. Doing STS on the raw data, and again on truncated principal components creates a climate space in which nearest-neighbour classification is similar to classification via linear discriminant analysis (LDA). Nevertheless, I found that LDA was somewhat more effective at climate year classification in training data, and thus possibly more desirable for classification of future climates. The purpose of this post is to do a first pass on cross-validated comparison of LDA and what for the moment I am calling spatiotemporal principal components analysis (ST PCA).

Classification Methods

The data for this analysis is a 1961-1990 time series of 14 annual variables at each of 168 BEC variant centroid surrogates. This is data from ClimateWNA v4.72.

Both methods use nearest-neighbour classification (k=1). Each 14-variable climate year is classified by assigning it to the nearest BEC variant 30-yr normal. The only difference between the two methods is the eigenspace in which this Euclidian measure is performed.

The eigenspace for the LDA is calculated using the “lda” call in the MASS package of R. This call extracts eigenvectors based on the ratio of between-group to within-group scatter, then standardizes the eigenvectors to unit within-group variance. I have confirmed that nearest neighbour classification in this eigenspace produces identical results to the classification produced by the predict(lda) call.

The eigenspace for the ST PCA is created by the following process: (1) standardizing the 14 raw climate variables by the average standard deviation of temporal variability of each BEC variant; (2) PCA using the “prcomp” call in the MASS package of R; and (3) standardizing retained eigenvectors by the average standard deviation of temporal variability of each BEC variant. The first round of standardization has the effect of supervising the PCA so that it selects eigenvectors with a high ratio of spatial (between-group) to temporal (within-group) variation. The second round of standardization removes the influence of redundant and correlated variation in the raw data, and appears to be similar to the eigenvector standardization performed by the “lda” call.

Cross-validation methodology

Two measures are taken to reduce the influence of spatial and temporal autocorrelation on the cross-validation results, using the “xval.buffer” method of the CaDENCE package in R (Cannon 2012). First, spatial autocorrelation is addressed by testing against blocks of the same three contiguous years in all BEC variants. Second, serial autocorrelation is addressed by a three-year buffer on the test data: i.e. 3 years on either side of the test data are withheld from the training data. There are 10 iterations of cross-validation over the 30-yr classification period, as illustrated in Figure 1.

Figure 1: illustration of buffered cross-validation using test data composed of contiguous blocks of three years, using the “xval.buffer” method of the CaDENCE package in R (Cannon 2012).

The meaning of “correct” in climate year classification

Supervised classification methods such as LDA are based on the principle of training the classification model with data of known class, and validating the model with test data of known class. In this typical context, the skill of the model is inferred from “correct classification rate”: the proportion of the test observations that are assigned the correct known class. In the context of climate year classification, there is no true “correct” classification:  the a priori class of each climate year is the spatial location of the observation (e.g. BEC variant centroid surrogate), and the assigned class is the climatic condition (e.g. 30-yr normal) that the climate year most closely resembles. “correct classification rate” in this context depends both on the ability of the model to differentiate the groups, and on the intrinsic differention (degree of overlap) between the groups. For this reason, “differentiation” is a more precise term for correct classification rate in climate year classification.

 

Results

As demonstrated in previous posts, LDA achieves higher differentiation in the training data than ST PCA (Figure 2). However, both methods achieved similar differentiation in the test data, suggesting that ST PCA and LDA are equally effective as climate year classification methods. Overfitting due to dimensionality is not apparent: differentiation in the test data declined only very slightly when more than ten eigenvectors were retained for classification. Cross-validated differentiation of individual BEC variants is equivalent for both methods (Figure 3).

Figure 2: cross-validated climate-year differentiation skill of LDA and ST PCA, for increasing dimensionality of eigenspaces.

Figure 3: cross-validated climate-year differentiation skill of LDA and ST PCA, showing the distribution of differentiation rates for each of the 168 BEC variants. Both classifications used 10 eigenvectors.

Discussion

Prior to this analysis, I expected LDA to be more prone to overfitting because selecting eigenvectors based purely on the ratio of between-group to within-group variance allows eigenvectors with low variance to be assigned moderate rank. ST PCA is likely more robust to overfitting because it prioritizes between-group variation while still ranking eigenvectors based on variance. This analysis suggests that LDA is indeed more prone to overfitting than ST PCA. The extent of overfitting in LDA was such that both methods have equivalent differentiation skill on the test data, despite better differentiation by LDA on the training data. This suggests that the other advantages of ST PCA over LDA, e.g. its more logical eigenvector structure, do not come at a cost in terms of classification skill.

The lack of overfitting at high dimensions is perplexing. I expected to see a dramatic decline in cross-validated skill of both methods when more than 10 eigenvectors were retained. Both methods standardize the retained eigenvectors to unit within-group variance. The effect of this standardization is to make all retained eigenvectors the same importance in the classification. It would seem logical that inflating the variance of dimensions with no discriminating power would confound the distance metric used for classification. I would also expect that the addition of these trivial dimensions would begin to invoke the curse of dimensionality, in which all observations seem equally distant from any given point in high-dimensional space. Neither of these effects are apparent, and the failure of this logic warrants further consideration.

Independent test data are essential for reliable model validation.  This is problematic for climate year classification, since the cross-validated skill of classification models is inflated by serial autocorrelation of time series data (Shabbar and Kharin 2007). The problem of finding independent data is exacerbated by the fact that there is no a priori class of a climate year, and thus no definitive means of declaring a classification as “correct”. However, the objective of this analysis is to compare the differentiating skill of LDA and spatiotemporal PCA, rather than to assess the absolute skill of either method. Imperfectly independent test data are likely to be adequate to assess the relative effectiveness of the methods.

 

References

Cannon, A. J. 2012. Neural networks for probabilistic environmental prediction: Conditional Density Estimation Network Creation & Evaluation (CaDENCE) in R. Computers & Geosciences 41:126–135.

Shabbar, A., and V. Kharin. 2007. An assessment of cross-validation for estimating skill of empirical seasonal forecasts using a global coupled model simulation. CLIVAR Exchanges 12:10–12.