Random Forest Modelling in R: Step by Step Process

What you will need:

  • A training dataset – this is either obtained from pre-existing data or by conducting field work to collect training points. In my case, there is an existing wetland database of over 6,000 wetland polygons throughout the Okanagan of classes fen, marsh, swamp, and shallow open water. Your model will only be as good as your training dataset.
  • A raster stack of the parameters you will use to predict wetland occurrences. In my case I have 32 parameters.
  • A computer with decent processing power.
  • The following R packages:
    • library(raster)
    • library(sf)
    • library(dplyr)
    • library(stringr)
    • library(tidyr)
    • library(janitor)
    • library(mlr3)
    • library(ranger)
    • library(randomforest)
    • library(tidyverse)
    • library(mlr3spatiotempcv)
    • library(mlr3learners)
    • library(mlr3tuning)
    • library(mlr3verse)

Step One: Create a stack of all your parameters

Recall from earlier posts that I have already calculated the parameters I will be using, which include DEM, TWI, TPI, optical data from Sentinel, distance from streams, and grid metrics. I have included example code below with a list of 19 parameters; however, I have since run the model using the following 32 parameters:

c(‘NDVI’, ‘NDWI’, ‘B2’, ‘B3’, ‘B4’, ‘B8’, ‘npol’, ‘DEM’, ‘TWI’, ‘TPI1’, ‘TPI10’, ‘sdist’, ‘zmean’, ‘zmax’, ‘zsd’, ‘imax’, ‘imean’, ‘imin’, ‘isd’, ‘TPI25’, ‘pro_curv’, ‘TRI’, ‘slope’, ‘aspect’, ‘diurnal’, ‘mrvb100’, ‘d’, ‘sd_ground’, ‘zmin’, ‘entropy’, ‘chm’, ‘mrvb10’)

Each raster layer (all saved as .tif) must have the same resolution, crs, and extents for it to stack properly.

Set aside this large raster stack for now, we will come back to it.

Step Two: Create a training database of 100,000 points

You will need a long list of training points with geographic location as well as wetland class. In my case I created 100,000 training points randomly selected from within 3,665 wetland polygons. The number of each wetland type is listed below. Marsh, swamp and shallow open water are most commonly occurring types. Fen is the rarest type of wetland. The original wetland database contained only 37 non-conflicting fen classifications within my study area. There were an additional 151 classifications that contained “fen” but were also classified as swamp, marsh or shallow open water. For instance: fen_marsh or swamp_fen_marsh.

I conducted 7 days of field work with the objective to validate the conflicting classifications containing “fen”. After visiting and classifying approximately 20 wetland sites, 12 of which were fens, I became familiar with site characteristics of this unique wetland class and was able to modify the conflicting polygons using air photo interpretation and pull out the fen areas from within these wetland complexes. This increased the number of fen polygons from 37 to 175.

#wet_sel_new$Class n percent
#Fen 175 0.04774898
#Marsh 1922 0.52442019
#Shallow_water 776 0.21173261
#Swamp 761 0.20763984

Step Three: Merge the 100,000 wetland points with the stacked parameters

Now you will need to merge your shapefile object consisting of 100,000 training points with location from step two and wetland class to the datastack you created in step one.

You will create a new dataframe by extracting the data from our raster stack (param_32) at the randomly sampled wetland points.

Be sure to drop rows containing NA as Random Forests don’t handle missing values in predictors. Also drop rows containing low sample points that will not be sufficient for training (i.e. Alkaline below).

#Classgeometry n percent
#Alkaline 8 8.546096e-05
#Fen 4479 4.784745e-02
#Marsh 49340 5.270804e-01
#Shallow_water 5781 6.175622e-02
#Swamp 34002 3.632304e-01

Code for steps one to three above has been included here: Setting up to train RF model

Step Four: Training the model

This is where it gets complicated

  • classif(ranger) specifies the type of learner we are using.
  • #wetland Class as factor
    data_sf$Class <- as.factor(data_sf$Class)
  • #set up the classification task
    task = TaskClassifST$new(‘data_sf’, backend = data_sf,
    target = ‘Class’)
  • #create the ranger classification learner
    learner = mlr3::lrn(“classif.ranger”, oob.error = TRUE,
    importance = ‘impurity’, predict_type = ‘prob’)

Task is our data. Learner is what we will tune and add to.

Note: oob error gets calculated for every tree then averaged across the forest. Classification accuracy is based on oob error. >70% is good.

Impurity is how clean the data classification is at each stage. It is given per class.

  • #hyperparameter search space
    search_space = ps(
    mtry = p_int(lower = 4, upper = 10),
    num.trees = p_int(lower = 500, upper = 2000),
    sample.fraction = p_dbl(lower = 0.5, upper = 0.8),
    max.depth = p_int(lower = 20, upper = 100),
    min.node.size = p_int(lower = 20, upper = 100)
    )

Tuning the hyperparameters using mlr3 tests all the different combinations between the ranges you set.

num.trees = number of trees. Need to find the sweet spot between too few and overfitting the data.

mtry: # of parameters to try to fit.

In a decision tree, the top node is the most important parameter to reduce variability in the data.

It should be noted that the same variable may be used multiple times in the same tree at different node locations.

  • #create spatial resampling k-fold 10
    resampling_sp = rsmp(“repeated_spcv_coords”, folds = 10, repeats = 1)

K-fold is the number of times it goes through the search space.

  • #classification error
    measure = msr(“classif.ce”)
  • #terminate after 100 evaluations so it doesn’t get hung up
    evals100 = trm(“evals”, n_evals = 100)
  • #create the tuning instance
    instance = TuningInstanceSingleCrit$new(
    task = task,
    learner = learner,
    resampling = resampling_sp,
    measure = measure,
    search_space = search_space,
    terminator = evals100
    )
  • #creat the tuner
    tuner = tnr(“grid_search”, resolution = 5)
  • #optimize the hyperparameters THIS WILL TAKE A LONG TIME TO RUN
    tuner$optimize(instance)

Create tuning instance wraps all the hyperparameters into one.

Create tuner w resolution of 5 keeps the search space reasonable instead of every combination of everything.

Optimize hyperparameters finds the best model, then this becomes the model you use. The “learner” is the model.

Additional resources:

mlr3 manual: 1 Introduction and Overview | mlr3 book (mlr-org.com)

Multilabel classification: https://mlr.mlr-org.com/articles/tutorial/multilabel.html 

Intro to machine learning course: https://introduction-to-machine-learning.netlify.app/

Random Forests · UC Business Analytics R Programming Guide (uc-r.github.io)

Hands-On Machine Learning with R (bradleyboehmke.github.io)

sample points on or in (sets of) spatial features — st_sample • sf (r-spatial.github.io)

Random forest and land cover classification: https://gis.stackexchange.com/questions/39021/how-to-perform-random-forest-land-cover-classification/39103

System time and cluster for parallel processing: https://www.youtube.com/watch?v=fal4Jj81uMA

Lang, Michel, Martin Binder, Jakob Richter, Patrick Schratz, Florian Pfisterer, Stefan Coors, Quay Au, Giuseppe Casalicchio, Lars Kotthoff, and Bernd Bischl. 2019. “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, December. https://doi.org/10.21105/joss.01903.

R Core Team (2020). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
URL http://www.R-project.org/.
All code was run using R version 4.0.3 (2020-10-10) — “Bunny-Wunnies Freak Out”.
RF explanation: https://towardsdatascience.com/implement-random-forest-in-r-b00b69eb8501

Leave a Reply

Your email address will not be published. Required fields are marked *

*