Monthly Archives: December 2013

A Mahalanobis metric for regional climate change analysis

Summary

A meaningful metric of climatic difference is helpful for climate change analysis. There is a strong statistical and ecological rationale for the use of Mahalanobis metrics in multivariate analysis of climatic novelties, since it removes the effect of correlations and trivial redundancies between climate variables. This post describes standardization of eigenvectors on the temporal component of spatiotemporal data. In other words, this method performs a second round of spatiotemporal standardization on features extracted from spatiotemporally standardized data. This second standardization approach appears to be functionally (and possibly mathematically) equivalent to spherical normalization of the within-groups covariance matrix of extracted features, which is commonly used within linear discriminant analysis to improve classification. In addition to its advantages in classification, this method provides a Mahalanobis metric whose units reflect the average temporal variability of the study area. This “M-metric” is a robust and ecologically meaningful metric for quantification of climatic differentiation and change.

The rationale for standardization of extracted features

Standardization (a.k.a. scaling) is a very common procedure in multivariate analysis. The basic rationale for standardization is that different variables have different and incomparable units of measurement (e.g. ^oC vs. mm). However, even for variables with comparable units, there is an additional ecological rationale for standardization of interannual variability: life strategies of organisms will generally be adapted to the background variability of any given climate element. Adaptation to historical variability is expected to be both temporal (e.g. winter vs. summer temperatures) and spatial (e.g. mean annual temperature in temperate vs. tropical regions) (Mora et al. 2013). This ecological rationale for standardization must be considered in metrics for climatic differences in both space and time.

Standardization gives all variables equal or comparable variance, and thus equal a priori influence on the outcome of a principle components analysis (PCA). However, PCA aggregates the correlated variance of the raw variables into the extracted principal components. The relative variance of the principal components thus reflects the redundant variance in the raw variables. Addition of a trivially and perfectly redundant variable to the raw data (e.g. a second set of MAT values), will have the non-trivial result of increasing the variance of one of the principal components and possibly increasing its rank. This behaviour highlights the importance of careful variable selection prior to performing PCA. Nevertheless, correlation between raw climate variables is inevitable, and so the ecological meaning of the unequal variance of principal components needs to be considered.

Extracted features (e.g. principal components) represent independent (uncorrelated) dimensions of climatic variation. Regardless of whether unequal variance in the principal components is due to trivial redundancies or meaningful correlations, the relative scales of variation in the principal components are unlikely to be ecologically meaningful due to life strategy adaptation. What is important is the magnitude of a climatic difference or change relative to the background variation of any given principal component. In this light, the ecological rationale for standardization of raw variables also applies to principal components and other eigenvectors extracted by other methods, such as LDA.

Mahalanobis distance vs. Euclidian distance

I have demonstrated in previous posts that principal components analysis doesn’t affect the distance between observations in climate space. Therefore, Euclidian distance on non-truncated principal components is the same as Euclidian distance on raw variables. Standardization of the principal components (dividing by the standard deviation of z-scores) creates a climate space in which Euclidean distance is equivalent to Mahalanobis distance in the standardized but unrotated climate space (Stephenson 1997). The units of standardized eigenvectors (e.g. standardized principal components) therefore are Mahalanobis metrics. Similar to z-scores, the precise meaning of a Mahalanobis metric depends on how the eigenvectors are derived from the raw data (e.g. PCA vs. LDA) and how the standardization population is defined (e.g. global vs. spatiotemporal).

Spherization of temporal climate envelopes

As shown in the previous post, spatiotemporal standardization operates by dividing each observation by the average standard deviation of climatic anomalies across the study area. This approach creates temporal climatic envelopes that are approximately spherical, and collectively have an average standard deviation of one z-score. Climate envelopes of individual locations will be larger or smaller, and more-or-less spherical, depending on their variability in each climate element relative to the study area average. Principal components analysis of this space is desirable, since it aggregates redundant variation into uncorrelated dimensions of variation. However, the aggregation of variance into principal components creates ellipsoidal (oblong) temporal climate envelopes. As discussed above, the ecological meaning ellipsoidal shapes of temporal climate envelope is ambiguous. A second round of spatiotemporal standardization of the principal components returns the temporal climate envelopes to an approximately spherical shape. This second standardization can be called “spherization of within-group variability,” or just “spherization” for simplicity. The metric of the sphered climate space can be called the M-score, because it represents a unit of Mahalanobis distance based on the average standard deviation of temporal anomalies of z-scores over all locations in the study area. The process of spherization of truncated principal components is illustrated in Figure 1.

Figure 1: effect of spatiotemporal standardization of principal components (spherization) on the variability structure of the BEC variants of the south-central interior BC.

Figure 2: the role of spherization in linear discriminant analysis of the BEC variants of the south-central interior BC.

Comparison with linear discriminant analysis

Many aspects of the above procedure (spherization of local anomalies (within-group variability) of the principal components of spatiotemporally standardized data) are conceptually similar to the procedures of linear discriminant analysis. A comparison of the methods and results of the two procedures is warranted.

Spatiotemporal standardization expresses spatial variability between locations in terms of the average temporal variability of all locations. This is similar, though not equivalent, to the ratio of between-groups to within-groups scatter matrix that is the basis of LDA: The within-groups scatter matrix is the sum of the individual group covariance matrices, which is equivalent to the average covariance matrix, which in turn is likely equivalent in ratio form to the average matrix of standard deviations. Variables that have higher spatial variability will have larger variance in a spatiotemporally standardized data set. A principal components analysis of spatiotemporally standardized data is therefore functionally similar to LDA, because both give high priority (eigenvalues) to variables with a high ratio of between-group to within-group variability.

Despite this functional similarity, PCA calculates eigenvalues of the total data variance, while LDA calculates eigenvalues of the ratios of between- to within-group variance. The extracted features are different as a result. Low-eigenvalues in PCA indicate residual dimensions with very low spatial and temporal variation (Figure 1). Low eigenvalues in LDA indicate dimensions with low spatial variation, but which may nevertheless have substantial temporal variability (Figure 2). PCA is advantageous because it facilitates truncation of the features with very low variance; in Figure 1 it is clear that the last four principal components can be dropped from further analysis with minimal loss of information. However, beyond the process of truncation, the advantages of PCA vs. LDA are ambiguous. PCA structures the climate space in order of decreasing total variance, while LDA structures the climate space in order of decreasing spatial differentiation. Both of these structures are climatically and ecologically meaningful in their own way, and their relative advantages will depend on the specific research question being asked.

Effect of spherization on climate year classification

Spherization is not integral to the feature extraction component of the LDA procedure, but it is typically incorporated into LDA because it has an important role in classification. For example, it is an integral component of the lda{MASS} call in R. it is logical that spherization of the principal components should also improve nearest neighbour classification. Figure 3 demonstrates the sequential effects of spatiotemporal standardization and spherization on differentiation (nearest-neighbour correct classification rates) between BEC variant temporal climate envelopes. Typical global z-standardization produces an average differentiation in the otherwise untransformed climate space of 36%. Spatiotemporal standardization improves differentiation to 41%. As demonstrated previously, PCA has no effect on differentiation. However, spherization of the PCA improves differentiation to 47%. LDA, which includes spherization as part of its classification procedure, achieves a differentiation of 53%. It is notable that LDA without spherization (not shown in Figure 3) only achieves 44% differentiation. Spherization is clearly advantageous for classification. Spatiotemporal standardization followed by spherization of principal components achieves classification results that are only marginally poorer than LDA.

Figure 3: effect of spatiotemporal (ST) standardization and spherization on climate year classification of BEC variants, compared to simple “global” z-standardization (left) and linear discriminant analysis (LDA, right). All classifications use the nearest-neighbour criterion.

Use of spherization and Mahalanobis distance in the ecological literature climate change

Standardized anomalies are frequently used in the atmospheric sciences , and are useful for univariate analysis because they quantify variability in terms of standard deviations (z-scores). Mora et al. (2013) adapted the concept of standardized anomalies to describe “climate departures”, i.e. the year at which the future temporal climate envelope of any given location ceases to overlap with its own historical temporal climate envelope. Although they did univariate analysis on seven climate variables, they did not perform any multivariate analysis. This is very likely due to the purpose of their paper as a simple way to communicate climate change impacts to a large non-expert audience. Nevertheless, the high impact of this paper suggests that an equivalent multivariate approach is in demand.

Williams et al. (2007) provided a multivariate analysis of standardized anomalies, using a standardized Euclidean distance metric. The authors may have opted against a Mahalanobis metric for the sake of simplicity and ease of communication. Nevertheless, standardized Euclidean distance metrics are susceptible to the unjustified influence of correlations in the input variables, and so are problematic for quantitative investigation of climate change.

Hamann and Wang (2006) used linear discriminant analysis on spatial variation in climate normals to define the climate envelopes of BEC variants for the purposes of climate change analysis. They explicitly noted that this method uses a Mahalanobis distance for climatic classification. This Mahalanobis metric is very similar to the one I have proposed because it is based on spherization of within-group variability. However, the units of the Mahalanobis metric in Hamann and Wang (2006) reflect the average spatial variation of climate normals within BEC variants, and thus the scale of BEC mapping. A Mahalanobis metric based on the temporal variation at individual locations is essentially independent of BEC mapping, which is likely preferable for subsequent analysis of BEC mapping.

As an alternative to discriminant analysis for species distribution modeling, Roberts and Hamann (2012) proposed nearest-neighbour classification using Euclidean distance of heavily truncated principal components. They claimed that this is a modified Mahalanobis distance, but do not specify any standardization of the principal components in their methodology. Euclidian distance on principal components is the same as Euclidian distance on raw variables. As a result, it does not appear that their method produces a Mahalanobis metric, though this needs to be verified.

References

Hamann, A., and T. Wang. 2006. Potential Effects of Climate Change on Ecosystem and Tree Species Distribution in British Columbia. Ecology 87:2773–2786.

Mora, C., A. G. Frazier, R. J. Longman, R. S. Dacks, M. M. Walton, E. J. Tong, J. J. Sanchez, L. R. Kaiser, Y. O. Stender, J. M. Anderson, C. M. Ambrosino, I. Fernandez-Silva, L. M. Giuseffi, and T. W. Giambelluca. 2013. The projected timing of climate departure from recent variability. Nature 502:183–7.

Roberts, D. R., and A. Hamann. 2012. Method selection for species distribution modelling: are temporally or spatially independent evaluations necessary? Ecography 35:792–802.

Stephenson, D. B. 1997. Correlation of spatial climate/weather maps and the advantages of using the Mahalanobis metric in predictions. Tellus 49A:513–527.

Williams, J. W., S. T. Jackson, and J. E. Kutzbach. 2007. Projected distributions of novel and disappearing climates by 2100 AD. Proceedings of the National Academy of Sciences of the United States of America 104:5738–42.

Spatio-temporal standardization of climate data

3 Replies

Summary

This post describes a modification of anomaly standardization for spatiotemporal climate data. I have not yet found an explicit description of this approach in the literature, so for the moment I am calling it spatiotemporal standardization. This method standardizes climate variables by the average standard deviation of interannual variability of all locations in the study area. It is a useful compromise between typical “global” z-standardization across the whole data set and anomaly standardization of each location’s interannual variability. Spatiotemporal standardization serves two important purposes for climate change analysis: (1) it provides an ecologically meaningful metric of climatic difference—the spatiotemporal z-score—that directly reflects observed interannual variability in each climate element; and (2) it supports PCA feature extraction that appropriately balances spatial (between-group) and temporal (within-group) variability. Unlike local anomaly standardization, spatiotemporal standardization does not distort the climate space: it preserves the relative distances between all observations within any given variable. However, it alters euclidean distance of the climate space by emphasizing variables with high spatial (between-group) variability, and therefore improves climate year classification results. Spatiotemporal standardization appears to be an effective foundation for analysis of bioclimate differentiation, historical variability, and degrees of climatic novelty.

Global, anomaly, and spatio-temporal standardization

Standardization is necessary for multivariate analysis of climate data because climatic elements (e.g. temperature and precipitation) have different units and scales (e.g. degrees Celsius and millimetres). Z-standardization subtracts the population mean from each observation, and then divides by the population standard deviation. This method reduces observations to z-scores, which have a mean of zero and unitless values representing standard deviations above and below the mean. Beyond the simple utility of standardization, the z-score is a useful concept for climatic analysis because it provides a metric of climatic difference that is directly related to the statistical variability of climate elements.

The z-standardization process can be customized for specific purposes by carefully defining the population. The most common approach to z-standardization is to treat all observations in the data as one population. This approach can be called “global standardization”. The major disadvantage of global standardization for spatio-temporal data sets is that it returns z-scores that arbitrarily reflect the spatial variation of climate elements over the study area (a larger study area will have a larger range of variation), and which therefore have limited ecological meaning. An alternate approach is to apply standardization locally and temporally by defining the population as the annual observations at one location over a given time period. This is called anomaly standardization in the atmospheric sciences. Standardized anomaly z-scores are a measure of climatic difference in space or time in terms of the historical range of climatic variability at an individual location. Standardized anomalies thus provide an ecologically meaningful metric for climatic differences between locations, for the magnitude of climate change, and for analog goodness-of-fit. For example, anomaly standardization is the basis for the “standardized Euclidian distance” approach described by Williams et al. (2007) for their analysis of novel and disappearing climates.

One limitation of anomaly standardization is that it transforms climate data at each location by a different scalar, and therefore distorts the spatial structure of temporal variability in the study area. This removes important information from the data and precludes comparison of the relative magnitude of historical variability and future climate change at different locations. One solution to this problem is to standardize observations by the mean of the anomaly standard deviations in all locations of the study area. This approach is simple enough that it very likely has been previously developed. However, I haven’t yet found an explicit description in the literature. For the moment, I am calling it “spatio-temporal standardization” because it provides z-scores based on temporal variability that nevertheless preserve the spatial structure of variability in the data.

Explicit description of spatiotemporal standardization

For a spatiotemporal variable composed of j time series observations at each of i locations, the standard deviation of anomalies at each location i is:

The mean anomaly standard deviation over all i locations is:

Each observation x_ij is spatiotemporally standardized to a unitless score z_ij by subtracting the grand mean of the variable, and dividing by the mean anomaly standard deviation:

Effect of spatio-temporal standardization on PCA feature extraction

In previous blog posts I found that PCA of globally standardized climate data was problematic because it allowed variables with large interannual (within-group) variability to dominate the climate space, even if there was relatively little spatial (between-group) variability. This was exemplified in climate space of south-central BC, where variables associated with winter temperatures (TD and MCMT) dominated the second principal component. This type of high within-group and low between-group variability is unlikely to be ecologically important, because organismal life strategies are adapted to it (e.g. via dormancy in plants). Discriminant analysis is designed to address this problematic type of variability, but I found that it goes too far: LDA extracts features only on the basis of the ratio of between- to within-group variability, regardless of whether those features are important to the overall spatiotemporal variability of the study area. A compromise between PCA and LDA is required.

Spatio-temporal standardization provides an effective approach to balancing within-group and between-group variability in feature extraction. By standardizing on average temporal variability of the study area, climate elements are given equal emphasis relative to each other at any given location. This is ecologically justified, since organismal life strategies can be expected to be adapted to the general scale of variability in each climate element at any given location. A PCA of spatiotemporally standardized data will then emphasize climate elements with high spatial variability.

Figure 1: biplots of the first two principal components of globally standardized and spatiotemporally standardized climate data for the BEC variants of south-central interior British Columbia (BG, PP, IDF, ICH, SBS, SBPS, MS, and ESSF BEC zones). Red arrows indicate correlations between the two PCs and the 14 standardized climate variables.

The difference between PCA on globally standardized and spatiotemporally standardized data is illustrated in biplots of reduced climate spaces for south-central interior BC (Figure 1). Winter temperature variables TD and MCMT dominate PC2 in the globally standardized data because of their high within-group variability. Spatiotemporal standardization leads to much better differentiation of BEC variants in PC2. Variability structure plots (Figure 2) indicate that the winter temperature feature is shifted from PC2 in the globally standardized data to PC4 in the spatiotemporally standardized data.

Figure 2: structure of variability in principal components of globally standardized and spatiotemporally standardized climate data for the BEC variants of south-central interior British Columbia (BG, PP, IDF, ICH, SBS, SBPS, MS, and ESSF BEC zones).

Effect of spatio-temporal standardization on climate-year classification

Spatio-temporal standardization performs centering and scaling by coefficients that are common to all observations of any given variable. Nevertheless, the relative scaling of the variables changes, which changes the relative importance of the variables in contributing to Euclidian distance measures. spatiotemporal standardization uses a denominator of temporal (within-group) variability. As a result, Euclidian metrics of spatiotemporally standardized data will emphasize variables with a high spatial (between-group) variation relative to temporal variation. As a result, it is logical that spatio-temporal standardization will improve nearest-neighbour classification. This effect is demonstrated in climate year classifications of BEC variants on globally vs. spatiotemporally standardized data (Figure 3). In this example, spatiotemporal standardization increases nearest-neighbour climate year differentiation from 36% to 41%. However, spatiotemporal standardization has no effect on LDA or QDA classification, since these methods include standardization procedures that override the a priori scaling of the input data.

Figure 3: comparison of climate year classification rates for globally and spatiotemporally standardized data.

References

Detrending climate time series—an evaluation of Empirical Mode Decomposition

Method Selection for Climate Year Classification of BEC variants

Leave a reply

Summary

In this post, I do climate year classifications on BEC variants at the provincial scale. I test the performance of variations on four distinct classification methods: nearest-neighbour classification, linear discriminant analysis, quadratic discriminant analysis, and Random Forests. Quadratic discriminant analysis is the clear winner, though the poor performance of Random Forests deserves a closer look. I do some simulations to illustrate the behaviour of these different methods. I establish that the “curse of dimensionality” is not apparent in this analysis, except for Random Forests. Finally, I provide a summary of QDA classification by BEC zone, indicating that BEC zones of the southern and central interior have significantly lower differentiation than those of the coast and northern interior.

Take-home message: QDA is much more powerful than PCA nearest neighbour classification at defining climatic domains of BEC variants (79% vs. 36% correct classification rate, respectively). This indicates that mahalanobis distance metrics (e.g. nearest neighbour on standardized variables, standard deviation of z-scores) are not sufficient to quantify climatic differentiation in a temporal climate envelope analysis. A more sophisticated distance metric is likely required for temporal climate envelope analysis.

Introduction to Climate Year Classification

A climate year classification applies a climate classification scheme to the variable climatic conditions of individual years at a single location, rather than to spatial variability of climate normals. This approach can be applied to BEC variants. The characteristic conditions of BEC variants can be identified as the centroid of their spatial variation in a set of climate normals. The individual years of the normal period at a location representing that centroid (the “centroid surrogate”) can then be classified based on their relative similarity to the conditions of the centroids.

Unsupervised classification using supervised classification procedures

Typical supervised classification procedures split data with known classes into training data and test data. The training data are used to parameterize the classification model, and the test data are used to evaluate the effectiveness the model. In supervised classification, the proportion of test observations that are correctly classified is a definitive criterion for model effectiveness. The correct classification rate indicates the degree of confidence with which the model can be used for “predictive” classification of new data with unknown classes.

In contrast to this conventional supervised classification approach, the objective of climate year classification is description, rather than prediction. Climate years have an a priori identity: the BEC variant of the centroid surrogate that they belong to. However, the actual class of the climate year is not known a priori. A confusion matrix in this context is a description of the characteristic conditions (in terms of BEC variants) of the climate years for each BEC variant centroid surrogate. In this context, classification “error” is a misnomer since the class of the climate year is not known a priori. Nevertheless, correct classification rate—the proportion of the climate years of a BEC variant that are classified as that BEC variant—is an ecologically meaningful measure of the climatic differentiation between BEC variants. In this way, climate year classification uses the validation tools of supervised classification—confusion matrices and correct classification rates—as the descriptive outputs of an unsupervised classification of climate years.

Climatic domains of BEC variants

Most classification algorithms assign classes to observations based on decision boundaries in the data space. Each class therefore is associated with a hypervolume (contiguous region of the data space), and all observations that fall into that region will be assigned to that class. In climate year classification, this hypervolume can be thought of as the “domain” of the BEC variant. BEC variants are map units of ecologically significant differences in climate, so the shape of this domain reflects to some extent the thresholds of ecological response to climate variation. However, the shapes of the domains also reflect to some extent the quality of the BEC mapping.

Figure 1: illustration of the BEC variant “domain” concept in climate year classification. In this simplified example of a nearest-neighbour classification in two dimensions, domain boundaries are lines of equal distance between BEC variant climatic centroids.

The climatic domain concept in is illustrated in Figure 1. For simplicity, this illustration shows a nearest-neighbour climate year classification in a two-dimensional climate space. There are several attributes of this illustration that apply to climate year classification in general:

The BEC variant’s climatic centroid is rarely at the centroid of the BEC variant’s domain. in this illustration, the centroid of the CWHdm is very close to the domain boundary, due to the proximity of the CWHxm2 centroid. This proximity may reflect true ecological thresholds, errors in the BEC itself, or insufficient dimensionality of the climate space.
Domains are unbounded for BEC variants on coastal (e.g. CWHvh1), elevational (e.g. MHwh1), or political (e.g. CDFmm) boundaries. This creates the potential for misclassification of novel climatic conditions and highlights the need for additional decision boundaries to bound the domains of BEC variants on all sides.
The climate space is not fully occupied. For example, there is a gap between the CWHwm and the MHmm1s. this gap may indicate climatic conditions that are truly absent from the landscape, or BEC units that are defined too broadly.

Candidate methods for climate year classification

Correct classification rate depends on many factors: (1) climate year variability relative to the distance to adjacent centroids, (2) whether the domain is bounded on all sides, and (3) whether the climate space includes the necessary dimensions of differentiation. In addition, classification method is an important consideration in ensuring that BEC variant domains are appropriately defined. The following classification methods are evaluated in this analysis:

nearest-neighbour classification on standardized raw variables (“raw”)
nearest-neighbour classification on principal components (“pca”)
linear discriminant analysis (“lda”)
nearest-neighbour classification on linear discriminants (“nnlda”)
linear discriminant analysis on principal components (“ldapc”)
linear discriminant analysis without normalization of within-groups covariance (“diylda”)
quadratic discriminant analysis (“qda”)
Random Forests (“rf”)

Some of these methods are theoretically equivalent, but nevertheless were included in the analysis to verify this equivalency in practice.

The data for this analysis are the 1961-1990 climate years at the 168 centroid surrogates representing forested BEC variants in British Columbia. I used 14 annual climate elements as the raw variables: MAT, MWMT, MCMT, TD, MAP, MSP, AHM, SHM, DD_0, DD5, FFP, PAS, Eref, and CMD. I log-transformed precipitation and heat-moisture variables (MAP, MSP, AHM, SHM, PAS) to normalize their distributions. All variables were standardized to a mean of zero and variance of 1.

Figure 2: comparison of correct classification rate (proportion of climate years in home domain) for eight classification methods: nearest-neighbour on standardized raw variables (“raw”); nearest-neighbour on principal components (“pca”); linear discriminant analysis (“lda”); nearest-neighbour on linear discriminants (“nnlda”); linear discriminant analysis on principal components (“ldapc”); linear discriminant analysis without normalization of within-groups covariance (“diylda”); quadratic discriminant analysis (“qda”); and random forests (“rf”).

Figure 2 shows correct classification rate for each of the classification methods. The following observations are notable:

Nearest neighbour classification returned identical results on the raw standardized variables and the principal components. This is logical since PCA does not perform a transformation of the data, only a rotation of the axes.
LDA outperforms PCA substantially. Since all features were retained for both classifications, the superiority of LDA is not due to better feature extraction. Instead, it is because the coefficients of the linear discriminants incorporate a transformation of the data based on the ratio of between-groups to within-groups variance.
The results of lda, nnlda, and ldapc classification are the same. This indicates that the predict(lda()) function in R is simply performing a nearest-neighbour classification on the linear discriminants. It also indicates that LDA on the PCs is functionally equivalent to LDA on the raw data.
LDA without spherical normalization of the within-groups covariance matrix (“diylda”) has intermediate performance between LDA and PCA, indicating that S_W normalization is an important component of LDA classification.
QDA vastly outperforms all of the other methods. This indicates that the within-groups covariance matrices are unequal, and also suggests that non-linear decision boundaries are more appropriate for these data.
Random Forests had the poorest median performance of all the methods. This is surprising, given that Wang et al. (2012) found that Random Forests outperformed LDA on a classification of spatial variation of climate normals. This poor result may be due to the use of default parameters in R, and should be investigated further.

Simulations of PCA, LDA, QDA, and Random Forests

To illustrate the behaviour of the four distinct classification methods, they were applied to simulated two-dimensional data of three groups with unequal within-groups covariance matrices (Figure 3). In this case, LDA provided a marginal improvement over nearest-neighbour classification. QDA provided a substantial improvement, due in part to the ability of the method to discern groups with different variance but also due to the use of a parabolic decision boundary between groups 1 and 2. Random Forests doesn’t use a decision boundary, which is evident in two bizarre misclassifications in the core domains of groups 1 and 2.

Figure 3: classifications of simulated two-dimensional data, illustrating the behaviour of nearest-neighbour classification, linear discriminant analysis, quadratic discriminant analysis, and Random Forests. Point symbols are the a priori identity of the data, and colours are the classes assigned by the classification methods.

Is the curse of dimensionality at play?

To avoid artefacts of dimension-reduction, all classifications in this trial were performed on full-dimensional data, i.e. where feature extraction is performed, all features are retained. There are 14 raw variables, and 14 extracted features. This approach raises the spectre of the “curse of dimensionality”. This term generally refers to the prohibitive sample sizes required to infer statistical significance in high-dimensional data. However, effective data sparsity may also affect inferences of similarity. To test the effect of dimensionality on climate year classification, I performed climate year classifications on reduced climate spaces composed of 2 to 14 extracted features (Figure 4). QDA and Random Forests were performed on linear discriminants, for simplicity in the case of the former and by necessity in the case of the latter.

Figure 4: Response of correct classification rate to number of features retained. This result indicates the “curse of dimensionality” is not evident in the classifications, except for a subtle effect in Random Forests.

All classification methods except for Random Forests respond positively to increasing number of features, indicating that the curse of dimensionality is not at play in these methods at this scale of dimensionality. Random forests exhibits a decline in correct classification rate after 9 dimensions, suggesting that it is more sensitive to dimensionality than the other methods. The consistent incremental response of QDA to increasing dimensions is striking, and suggests that addition of more raw variables could improve classification further. However, this result is likely due to the use of linear discriminants as the input data to QDA. This test could be refined by projecting the data onto quadratic discriminants.

QDA classification summaries for BEC zones

One of the purposes of climate year classification is to understand the relative climatic differentiation in different regions of the province. Low differentiation in a BEC zone suggests that BEC mapping is at a finer climatic scale, either because plant communities are more sensitive to subtle climatic differences, or because mapping approaches differed amongst regional ecologists responsible for the BEC mapping.

Figure 5 indicates that the southern and central interior BEC zones have lower differentiation than the coastal and northern interior zones. It is likely that this effect would be even more pronounced if the classification was controlled for unbounded domains.

Figure 5: QDA correct classification rate, summarized by BEC zone. The coastal (CWH/MH) and northern interior (SWB and BWBS) regions have higher correct classification rates than the southern and central interior regions.

Possible Next Steps

Develop a method for controlling for unbounded domains. This will likely involve applying a boundary to a certain distance (e.g. group-specific standard deviation) from each group centroid. This approach will allow identification of novel conditions.
Investigate any limitations of QDA for climate year classification.
Verify the poor performance of Random Forests.
Test the methods on a high-dimensional data set (e.g. all monthly variables).

References

Wang, T., E. M. Campbell, G. A. O’Neill, and S. N. Aitken. 2012. Projecting future distributions of ecosystem climate niches: Uncertainties and management applications. Forest Ecology and Management 279:128–140.

Experimenting with linear discriminant analysis for temporal climate space analysis

The Seasons Alter

Investigating past and future climate change in British Columbia

Monthly Archives: December 2013

A Mahalanobis metric for regional climate change analysis

Spatio-temporal standardization of climate data

Detrending climate time series—an evaluation of Empirical Mode Decomposition

Method Selection for Climate Year Classification of BEC variants

Experimenting with linear discriminant analysis for temporal climate space analysis