The Random Forest Model | Predicting changes in distribution of Acer macrophyllum in the Pacific Northwest

The random forest model was introduced by Breiman in 2001, and has since become a widely used classification method across disciplines. It is based on decision trees (CART), and uses an aggregate of trees fitted to subsets of the available data to make predictions (Breiman, 2001; Pawley, 2018). Random forest classified each based on the most common class determined by individual trees. As the number of trees increase, the average classification of all trees converges to a limit which is more likely to reflect the actual class than traditional CART models (Brieman, 2001).

The method of fitting with random subsets of the training data allows for a built-in verification method, the out-of-bag error estimate, which tests the accuracy of subgroups of trees using excluded data points (Breiman, 2001; Pawley, 2018). This allows for all known points to be used in training, instead of removing a subset to use for model verification. In addition, it is common to perform cross-validation using k-folds to further assess model accuracy (Barnard, et al., 2019; Evans, et al., 2011; Pawley, 2018). Like other machine learning techniques, random forest does not rely on assumptions about the distribution of the underlying data or on any defined relationship between input and response parameters (Garzon, et al., 2006; Pawley, 2018). These properties make it easier to model complex systems with a large set of variables. Many other machine learning models are prone to overfitting (Garzon, et al., 2006; Pawley, 2018), which is when the predictions rely too strongly on the training points, resulting in highly non-linear correlations that do not reflect the underlying systems. Because the random forest technique aggregates many individual models, it has a lower chance of overfitting compared to other CART models (Pawley, 2018).

Machine learning methods tend to be more accurate in ecological modelling than Bayesian approaches due to their versatility. Ecological systems are complex, and often exhibit non-linear correlations, spatial auto-correlation, non-stationarity, anisotropy, and other properties which cannot be accounted for with traditional statistical techniques (Evans, et al., 2011). While the random forest method has only recently been widely applied in ecological modelling (Garzon, et al., 2006), the results are very promising. A number of studies have found that random forest predicts species distribution better than both alternative machine learning and statistical methods (Barnard, et al., 2019; Cutler, et al., 2007; Evans et al., 2011; Garzon, et al., 2006; Heikinnen, Marmion & Luoto, 2012; Mi et al., 2017).

In addition to high accuracy, the random forest method allows us to determine the most predictive variables from a large set of inputs, and is highly flexible in its applications (Cutler, et al., 2007). Random forest uses a novel method for assessing parameter importance which does not rely on measures of statistical significance (Cutler, et al., 2007). Since the random forest only uses a subset of predictor variables for each tree, there are subsets of the model for which any given variable is excluded, or “out-of-bag”. The model calculates a misclassification rate by permuting out-of-bag variables and this is aggregated to assess parameter importance (Cutler, et al., 2007). Cutler et al. (2007) found that unlike in other methods, the parameters assigned high importance by the random forest algorithm reflect the expected factors controlling habitat suitability found in literature.

Previous Next