1. INTRODUCTION

This technical report outlines the Random Forest modeling workflow used to map carbon stocks in the organic soil layers (forest floor: L and FH horizons). The process is divided into six stages, each documented in separate scripts located within the /scripts folder:

Exploratory Data Analysis
Feature Selection Procedure
Random Forest Modeling
Model Validation
Spatial Prediction of Carbon Stocks
Ecological Interpretation and Hotspot Mapping

2. REQUIRED R PACKAGES

These packages are required for data manipulation, statistical analysis, and plotting throughout the modeling workflow.

library(agricolae)
library(akima)
library(Boruta) 
library(biscale)
library(car)
library(caret)
library(CAST)
library(coin)
library(cowplot)
library(data.table)
library(DescTools)
library(doParallel)
library(dplyr)
library(e1071)
library(emmeans)
library(extrafont)
library(foreach)
library(forcats)
library(ggplot2)
library(ggpubr)
library(ggRandomForests)
library(grid)
library(gridExtra)
library(gt)
library(here)
library(Hmisc)
library(Kendall)
library(magick)
library(ModelMap)
library(moments)
library(parallel)
library(pdp)
library(PerformanceAnalytics)
library(plot3D)
library(plotmo)
library(psych)
library(purrr)
library(randomForest)
library(randomForestSRC)
library(raster)
library(RColorBrewer)
library(readr)
library(readxl)
library(rfUtilities)
library(rgdal)
library(sf)
library(sp)
library(stats)
library(tidyverse)
library(tools)
library(vivid)
library(plotly)

3. EXPLORATORY ANALYSIS (STAGE 1)

3.1. Load data

Important:

The file RF_data_Carbon_ForestFloor.csv is an example with a structure similar to the real data.
The original data is not shared publicly for confidentiality reasons.
The code was designed to work fully with this simulated version.

🔒 Confidentiality Notice: If you have questions or need access to the real data, you can contact the repository author:

Zaira Rosario Pérez-Vázquez
zairpv@gmail.com

data_path <- here("data", "RF_data_Carbon_ForestFloor.csv")
datasetFF <- read_csv(data_path)
str(datasetFF)

## spc_tbl_ [918 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ YEAR            : num [1:918] 2013 2013 2013 2013 2013 ...
##  $ PSP             : num [1:918] 1 1 1 1 1 1 1 1 2 2 ...
##  $ SITE            : num [1:918] 1 1 2 2 3 3 4 4 1 1 ...
##  $ LAYER           : chr [1:918] "L" "FH" "L" "FH" ...
##  $ C_STOCKS        : num [1:918] 2.48 17.22 2.11 13.51 2.46 ...
##  $ UTMX            : num [1:918] 541296 541296 541296 541296 541335 ...
##  $ UTMY            : num [1:918] 2280448 2280448 2280498 2280498 2280421 ...
##  $ STAND_AGE       : num [1:918] 29 29 29 29 30 30 29 29 22 22 ...
##  $ BASAL_AREA      : num [1:918] 32.5 32.5 21.3 21.3 26.9 ...
##  $ DOM_HEIGHT      : num [1:918] 26.5 26.5 22.8 22.8 23.7 ...
##  $ SPECIES_RICHNESS: num [1:918] 10 10 9 9 12 12 8 8 4 4 ...
##  $ SHANNON_INDEX   : num [1:918] 1.5 1.5 1.97 1.97 2.14 ...
##  $ GAP_FRACTION    : num [1:918] 9.07 9.07 9.07 9.07 10.37 ...
##  $ CANOPY_COVER    : num [1:918] 90.9 90.9 90.9 90.9 89.6 ...
##  $ ELEVATION       : num [1:918] 2081 2081 2085 2085 2079 ...
##  $ SLOPE           : num [1:918] 4.06 4.06 5.1 5.1 4.69 ...
##  $ ASPECT          : num [1:918] 154 154 152 152 124 ...
##  $ CONDITION       : chr [1:918] "Managed" "Managed" "Managed" "Managed" ...
##  $ LAYER_NUM       : num [1:918] 1 2 1 2 1 2 1 2 1 2 ...
##  $ CON_NUM         : num [1:918] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   YEAR = col_double(),
##   ..   PSP = col_double(),
##   ..   SITE = col_double(),
##   ..   LAYER = col_character(),
##   ..   C_STOCKS = col_double(),
##   ..   UTMX = col_double(),
##   ..   UTMY = col_double(),
##   ..   STAND_AGE = col_double(),
##   ..   BASAL_AREA = col_double(),
##   ..   DOM_HEIGHT = col_double(),
##   ..   SPECIES_RICHNESS = col_double(),
##   ..   SHANNON_INDEX = col_double(),
##   ..   GAP_FRACTION = col_double(),
##   ..   CANOPY_COVER = col_double(),
##   ..   ELEVATION = col_double(),
##   ..   SLOPE = col_double(),
##   ..   ASPECT = col_double(),
##   ..   CONDITION = col_character(),
##   ..   LAYER_NUM = col_double(),
##   ..   CON_NUM = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

3.2. Data Preparation

The categorical variables are converted to factors to enable data grouping by sampling year, organic layer (L and FH), and forest management conditions.

datasetFF <- datasetFF %>%
  mutate(
    YEAR = as.factor(YEAR),
    LAYER = as.factor(LAYER),
    CONDITION = as.factor(CONDITION),
    LAYER_NUM = as.factor(LAYER_NUM),
    CON_NUM = as.factor(CON_NUM)
  )

3.3. Summary Statistics

To explore the distribution of carbon stocks in the forest floor, we computed descriptive statistics using the psych package.

describe() provides overall statistics for the C_STOCKS variable, including mean, standard deviation, minimum, maximum, skewness, and kurtosis.
describeBy() is used to calculate the same summary statistics, but grouped by sampling year and organic layer (L and FH), which helps us assess variation across both temporal and vertical gradients. An

describe(datasetFF$C_STOCKS)

##    vars   n mean   sd median trimmed mad min   max range skew kurtosis   se
## X1    1 918 4.31 3.88   2.81    3.56 1.9   0 26.25 26.25  2.2     5.83 0.13

describeBy(datasetFF$C_STOCKS, list(datasetFF$YEAR, datasetFF$LAYER))

## 
##  Descriptive statistics by group 
## : 2013
## : FH
##    vars   n  mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 153 10.37 5.24   9.46   10.08 4.69   0 26.25 26.25 0.56     0.27 0.42
## ------------------------------------------------------------ 
## : 2018
## : FH
##    vars   n mean  sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 153  4.2 1.9   4.09    4.18 1.89   0 10.01 10.01  0.1     0.11 0.15
## ------------------------------------------------------------ 
## : 2023
## : FH
##    vars   n mean   sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 153 5.13 2.31   5.18    5.07 2.16 0.94 14.32 13.38 0.58     1.26 0.19
## ------------------------------------------------------------ 
## : 2013
## : L
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 153 2.06 0.79   1.96       2 0.59 0.44 5.44  4.99 1.23     3.46 0.06
## ------------------------------------------------------------ 
## : 2018
## : L
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 153 1.89 0.58    1.8    1.84 0.52 0.8   4  3.21 0.83     0.99 0.05
## ------------------------------------------------------------ 
## : 2023
## : L
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 153 2.23 0.87   2.14    2.18 0.78 0.66 4.87   4.2 0.63     0.23 0.07

3.4. Data distribution (Histograms and Normality)

To assess the distribution of carbon stocks (C_STOCKS) across sampling years (2013, 2018, 2023) and organic layers (L and FH), we used histograms and normality diagnostics.

Histograms were generated using ggplot2, displaying the frequency distributions of C_STOCKS faceted by year and layer. This layout facilitates the identification of temporal and vertical patterns in carbon accumulation.

L layer

# Filtrar datos por año y capa L
data2013_L <- subset(datasetFF, YEAR == "2013" & LAYER == "L")
data2018_L <- subset(datasetFF, YEAR == "2018" & LAYER == "L")
data2023_L <- subset(datasetFF, YEAR == "2023" & LAYER == "L")

FH layer

# Filtrar datos por año y capa FH
data2013_FH <- subset(datasetFF, YEAR == "2013" & LAYER == "FH")
data2018_FH <- subset(datasetFF, YEAR == "2018" & LAYER == "FH")
data2023_FH <- subset(datasetFF, YEAR == "2023" & LAYER == "FH")

To evaluate the normality of carbon stock distributions in the L layer, we applied a combined visual and statistical approach. For each year, we produced histograms overlaid with both the kernel density estimate (blue line) and a theoretical normal distribution (red line), calculated using the observed mean and standard deviation. A dashed orange line indicates the mean value of each distribution.

3.4.1. Histograms - L layer

# Expression for x-axis label
Mgha_expression1 <- expression("Mg ha"^"-1")

# Plot for 2013 - L
HIST_CL13 <- ggplot(data2013_L, aes(x = C_STOCKS)) +
  geom_histogram(aes(y=..density..), alpha=0.3, size=0.4, color="darkgrey", binwidth=0.5, fill="lightgray") +
  stat_function(aes(color="theoretical normal distribution"),
                fun=dnorm, args=list(mean=mean(data2013_L$C_STOCKS), sd=sd(data2013_L$C_STOCKS)), size=0.5) +
  geom_density(aes(color="density"), size=0.5) +
  geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=TRUE), color="mean"), linetype="dashed", size=0.8) +
  scale_color_manual(name="Data distribution\n               L layer:",
                     values=c("density"="blue", "theoretical normal distribution"="red", "mean"="orange"),
                     labels=c("Observed density", "Mean value", "Theoretical normal")) +
  labs(x=Mgha_expression1, y="Density", title="2013") +
  annotate("text", x=4.85, y=0.68, label="p = 0.069", family="A") +
  theme_bw() +
  theme(text=element_text(size=12, family="A"),
        legend.position="bottom",
        legend.title=element_text(hjust=0.5, size=11, face="bold"),
        legend.text=element_text(size=11),
        plot.title=element_text(size=14, face="bold", hjust=0.5),
        axis.text=element_text(size=12))

# Plot for 2018 - L
HIST_CL18 <- ggplot(data2018_L, aes(x = C_STOCKS)) +
  geom_histogram(aes(y=..density..), alpha=0.3, size=0.4, color="darkgrey", binwidth=0.5, fill="lightgray") +
  stat_function(aes(color="theoretical normal distribution"),
                fun=dnorm, args=list(mean=mean(data2018_L$C_STOCKS), sd=sd(data2018_L$C_STOCKS)), size=0.5) +
  geom_density(aes(color="density"), size=0.5) +
  geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=TRUE), color="mean"), linetype="dashed", size=0.8) +
  scale_color_manual(name="Data distribution\n               L layer:",
                     values=c("density"="blue", "theoretical normal distribution"="red", "mean"="orange"),
                     labels=c("Observed density", "Mean value", "Theoretical normal")) +
  labs(x=Mgha_expression1, y="Density", title="2018") +
  annotate("text", x=4.85, y=0.68, label="p = 0.069", family="A") +
  theme_bw() +
  theme(text=element_text(size=12, family="A"),
        legend.position="bottom",
        legend.title=element_text(hjust=0.5, size=11, face="bold"),
        legend.text=element_text(size=11),
        plot.title=element_text(size=14, face="bold", hjust=0.5),
        axis.text=element_text(size=12))

# Plot for 2023 - L
HIST_CL23 <- ggplot(data2023_L, aes(x = C_STOCKS)) +
  geom_histogram(aes(y=..density..), alpha=0.3, size=0.4, color="darkgrey", binwidth=0.5, fill="lightgray") +
  stat_function(aes(color="theoretical normal distribution"),
                fun=dnorm, args=list(mean=mean(data2023_L$C_STOCKS), sd=sd(data2023_L$C_STOCKS)), size=0.5) +
  geom_density(aes(color="density"), size=0.5) +
  geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=TRUE), color="mean"), linetype="dashed", size=0.8) +
  scale_color_manual(name="Data distribution\n               L layer:",
                     values=c("density"="blue", "theoretical normal distribution"="red", "mean"="orange"),
                     labels=c("Observed density", "Mean value", "Theoretical normal")) +
  labs(x=Mgha_expression1, y="Density", title="2023") +
  annotate("text", x=4.85, y=0.68, label="p = 0.069", family="A") +
  theme_bw() +
  theme(text=element_text(size=12, family="A"),
        legend.position="bottom",
        legend.title=element_text(hjust=0.5, size=11, face="bold"),
        legend.text=element_text(size=11),
        plot.title=element_text(size=14, face="bold", hjust=0.5),
        axis.text=element_text(size=12))

# Combine the plots into a single panel
Supp1 <- ggarrange(HIST_CL13, HIST_CL18, HIST_CL23,
                   ncol=3, nrow=1,
                   common.legend=TRUE, legend="bottom")

# Display the panel
Supp1

3.4.2. Histograms - FH layer

# Plot for 2013 - FH
HIST_CFH13=ggplot(data2013_FH, aes(x = C_STOCKS)) +
    geom_histogram(aes(y=..density..),alpha=0.3, size = 0.4, color = "darkgrey",binwidth = 4, fill = "lightgray", color = "black") +
    geom_density(color = "blue", size = 0.5,position = "identity") +
    labs(x = Mgha_expression1, y = "Density", title = "a) C stocks")+
    annotate("text", x=23.1, y=0.09, label= "p = 0.394",family = "A") +
    geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=T)),   # Ignore NA values for mean
               color="orange", linetype="dashed", size=0.8)+
    stat_function(fun = dnorm, args = list(mean = mean(data2013_FH$C_STOCKS), 
                                           sd = sd(data2013_FH$C_STOCKS)), color = "red",  size = 0.5)+
    theme_bw()+
    theme(text = element_text(size=11,family = "A"),
          plot.title = element_text(size = 14,face = "bold",hjust = 0.5),
          plot.subtitle = element_text(size = 11,face = "bold",hjust = 0),
          axis.text = element_text(size = 11))+
    labs(title = "2013")

# Plot for 2018 - FH
HIST_CFH18=ggplot(data2018_FH, aes(x = C_STOCKS)) +
    geom_histogram(aes(y=..density..),alpha=0.3, size = 0.4, color = "darkgrey",binwidth = 1.5, fill = "lightgray", color = "black") +
    geom_density(color = "blue", size = 0.5,position = "identity") +
    labs(x = Mgha_expression1, y = "Density", title = "a) C stocks")+
    annotate("text", x=8.6, y=0.22, label= "p = 0.825",family = "A") +
    geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=T)),   # Ignore NA values for mean
               color="orange", linetype="dashed", size=0.8)+
    stat_function(fun = dnorm, args = list(mean = mean(data2018_FH$C_STOCKS), 
                                           sd = sd(data2018_FH$C_STOCKS)), color = "red",  size = 0.5)+
    theme_bw()+
    theme(text = element_text(size=11,family = "A"),
          plot.title = element_text(size = 14,face = "bold",hjust = 0.5),
          plot.subtitle = element_text(size = 11,face = "bold",hjust = 0),
          axis.text = element_text(size = 11))+
    labs(title = "2018")

# Plot for 2023 - FH
HIST_CFH23=ggplot(data2023_FH, aes(x = C_STOCKS)) +
    geom_histogram(aes(y=..density..),alpha=0.3, size = 0.4, color = "darkgrey",binwidth = 2.5, fill = "lightgray", color = "black") +
    geom_density(color = "blue", size = 0.5,position = "identity") +
    labs(x = Mgha_expression1, y = "Density", title = "a) C stocks")+
    annotate("text", x=13, y=0.18, label= "p = 0.788",family = "A") +
    geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=T)),   # Ignore NA values for mean
               color="orange", linetype="dashed", size=0.8)+
    stat_function(fun = dnorm, args = list(mean = mean(data2023_FH$C_STOCKS), 
                                           sd = sd(data2023_FH$C_STOCKS)), color = "red",  size = 0.5)+
    theme_bw()+
    theme(text = element_text(size=11,family = "A"),
          plot.title = element_text(size = 14,face = "bold",hjust = 0.5),
          plot.subtitle = element_text(size = 11,face = "bold",hjust = 0),
          axis.text = element_text(size = 11))+
    labs(title = "2023")

# Combine the plots into a single panel
Supp2=ggarrange(HIST_CFH13,HIST_CFH18,HIST_CFH23,
                ncol = 3,nrow = 1,
                common.legend = TRUE,legend="bottom")

# Display the panel
Supp2

3.4.3. Normality tests

Additionally, the Kolmogorov-Smirnov test was used to formally assess whether the data deviated significantly from a normal distribution. These results were visualized separately per year and integrated into a single panel for clarity.

In all three cases, the test returned p-values greater than 0.5, indicating that the null hypothesis of normality could not be rejected. These results suggest that the carbon stock data for the L layer is reasonably consistent with a normal distribution across the evaluated years.

L layer

ks.test(data2013_L$C_STOCKS, 
        pnorm, 
        mean(data2013_L$C_STOCKS), 
        sd(data2013_L$C_STOCKS))

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data2013_L$C_STOCKS
## D = 0.1049, p-value = 0.06897
## alternative hypothesis: two-sided

ks.test(data2018_L$C_STOCKS,pnorm,mean(data2018_L$C_STOCKS),sd(data2018_L$C_STOCKS))

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data2018_L$C_STOCKS
## D = 0.092277, p-value = 0.1477
## alternative hypothesis: two-sided

ks.test(data2023_L$C_STOCKS,pnorm,mean(data2023_L$C_STOCKS),sd(data2023_L$C_STOCKS))

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data2023_L$C_STOCKS
## D = 0.069741, p-value = 0.4463
## alternative hypothesis: two-sided

FH layer

ks.test(data2013_FH$C_STOCKS, 
        pnorm, 
        mean(data2013_FH$C_STOCKS), 
        sd(data2013_FH$C_STOCKS))

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data2013_FH$C_STOCKS
## D = 0.072686, p-value = 0.394
## alternative hypothesis: two-sided

ks.test(data2018_FH$C_STOCKS,pnorm,mean(data2018_FH$C_STOCKS),sd(data2018_FH$C_STOCKS))

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data2018_FH$C_STOCKS
## D = 0.050761, p-value = 0.8254
## alternative hypothesis: two-sided

ks.test(data2023_FH$C_STOCKS,pnorm,mean(data2023_FH$C_STOCKS),sd(data2023_FH$C_STOCKS))

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data2023_FH$C_STOCKS
## D = 0.052754, p-value = 0.7881
## alternative hypothesis: two-sided

3.5. Spearman Correlations

To explore monotonic relationships between carbon stocks and a set of biotic and abiotic variables, we computed Spearman’s rank correlation coefficients for the year 2013. This non-parametric method is suitable for assessing correlations without assuming normality or linearity in the data.

The analysis included spatial coordinates (UTMX, UTMY), stand structure variables (STAND_AGE, BASAL_AREA, DOM_HEIGHT), biodiversity metrics (SPECIES_RICHNESS, SHANNON_INDEX), canopy features (GAP_FRACTION, CANOPY_COVER), and topographic attributes (ELEVATION, SLOPE, ASPECT).

The output provides:

Spearman’s ρ (rho) values, indicating the strength and direction of associations (from –1 to +1).
p-values, indicating whether correlations are statistically significant.

Values close to ±1 represent strong associations. Significant positive or negative correlations with carbon stocks (C_STOCKS) suggest variables that may influence or co-vary with organic carbon accumulation in the forest floor during that year.

Example for 2013

data2013 <- datasetFF %>% filter(YEAR == "2013")
corr_vars <- data2013 %>% select(C_STOCKS, UTMX, UTMY, STAND_AGE, BASAL_AREA, DOM_HEIGHT, SPECIES_RICHNESS, SHANNON_INDEX, GAP_FRACTION, CANOPY_COVER, ELEVATION, SLOPE, ASPECT)
rcorr(as.matrix(corr_vars), type = "spearman")

##                  C_STOCKS  UTMX  UTMY STAND_AGE BASAL_AREA DOM_HEIGHT
## C_STOCKS             1.00  0.02  0.08      0.12       0.19       0.19
## UTMX                 0.02  1.00 -0.07      0.04       0.32       0.00
## UTMY                 0.08 -0.07  1.00      0.13       0.17       0.50
## STAND_AGE            0.12  0.04  0.13      1.00       0.56       0.53
## BASAL_AREA           0.19  0.32  0.17      0.56       1.00       0.59
## DOM_HEIGHT           0.19  0.00  0.50      0.53       0.59       1.00
## SPECIES_RICHNESS     0.12  0.18 -0.03      0.49       0.53       0.42
## SHANNON_INDEX        0.11  0.13  0.08      0.54       0.43       0.47
## GAP_FRACTION        -0.12 -0.36 -0.05     -0.51      -0.72      -0.42
## CANOPY_COVER         0.12  0.36  0.05      0.51       0.72       0.42
## ELEVATION            0.07 -0.42  0.65      0.01      -0.02       0.15
## SLOPE               -0.07  0.07 -0.29     -0.13      -0.12      -0.01
## ASPECT               0.03  0.13  0.11      0.10       0.17       0.12
##                  SPECIES_RICHNESS SHANNON_INDEX GAP_FRACTION CANOPY_COVER
## C_STOCKS                     0.12          0.11        -0.12         0.12
## UTMX                         0.18          0.13        -0.36         0.36
## UTMY                        -0.03          0.08        -0.05         0.05
## STAND_AGE                    0.49          0.54        -0.51         0.51
## BASAL_AREA                   0.53          0.43        -0.72         0.72
## DOM_HEIGHT                   0.42          0.47        -0.42         0.42
## SPECIES_RICHNESS             1.00          0.87        -0.61         0.61
## SHANNON_INDEX                0.87          1.00        -0.59         0.59
## GAP_FRACTION                -0.61         -0.59         1.00        -1.00
## CANOPY_COVER                 0.61          0.59        -1.00         1.00
## ELEVATION                   -0.13         -0.03        -0.01         0.01
## SLOPE                       -0.03         -0.07         0.19        -0.19
## ASPECT                       0.16          0.14        -0.29         0.29
##                  ELEVATION SLOPE ASPECT
## C_STOCKS              0.07 -0.07   0.03
## UTMX                 -0.42  0.07   0.13
## UTMY                  0.65 -0.29   0.11
## STAND_AGE             0.01 -0.13   0.10
## BASAL_AREA           -0.02 -0.12   0.17
## DOM_HEIGHT            0.15 -0.01   0.12
## SPECIES_RICHNESS     -0.13 -0.03   0.16
## SHANNON_INDEX        -0.03 -0.07   0.14
## GAP_FRACTION         -0.01  0.19  -0.29
## CANOPY_COVER          0.01 -0.19   0.29
## ELEVATION             1.00 -0.57   0.05
## SLOPE                -0.57  1.00  -0.07
## ASPECT                0.05 -0.07   1.00
## 
## n= 306 
## 
## 
## P
##                  C_STOCKS UTMX   UTMY   STAND_AGE BASAL_AREA DOM_HEIGHT
## C_STOCKS                  0.6860 0.1840 0.0379    0.0008     0.0008    
## UTMX             0.6860          0.2476 0.5141    0.0000     0.9740    
## UTMY             0.1840   0.2476        0.0204    0.0030     0.0000    
## STAND_AGE        0.0379   0.5141 0.0204           0.0000     0.0000    
## BASAL_AREA       0.0008   0.0000 0.0030 0.0000               0.0000    
## DOM_HEIGHT       0.0008   0.9740 0.0000 0.0000    0.0000               
## SPECIES_RICHNESS 0.0389   0.0016 0.6289 0.0000    0.0000     0.0000    
## SHANNON_INDEX    0.0457   0.0259 0.1498 0.0000    0.0000     0.0000    
## GAP_FRACTION     0.0423   0.0000 0.3943 0.0000    0.0000     0.0000    
## CANOPY_COVER     0.0423   0.0000 0.3943 0.0000    0.0000     0.0000    
## ELEVATION        0.2192   0.0000 0.0000 0.9044    0.7036     0.0076    
## SLOPE            0.2284   0.2116 0.0000 0.0204    0.0423     0.8439    
## ASPECT           0.6386   0.0283 0.0460 0.0715    0.0022     0.0403    
##                  SPECIES_RICHNESS SHANNON_INDEX GAP_FRACTION CANOPY_COVER
## C_STOCKS         0.0389           0.0457        0.0423       0.0423      
## UTMX             0.0016           0.0259        0.0000       0.0000      
## UTMY             0.6289           0.1498        0.3943       0.3943      
## STAND_AGE        0.0000           0.0000        0.0000       0.0000      
## BASAL_AREA       0.0000           0.0000        0.0000       0.0000      
## DOM_HEIGHT       0.0000           0.0000        0.0000       0.0000      
## SPECIES_RICHNESS                  0.0000        0.0000       0.0000      
## SHANNON_INDEX    0.0000                         0.0000       0.0000      
## GAP_FRACTION     0.0000           0.0000                     0.0000      
## CANOPY_COVER     0.0000           0.0000        0.0000                   
## ELEVATION        0.0261           0.5621        0.8946       0.8946      
## SLOPE            0.6487           0.2398        0.0008       0.0008      
## ASPECT           0.0054           0.0129        0.0000       0.0000      
##                  ELEVATION SLOPE  ASPECT
## C_STOCKS         0.2192    0.2284 0.6386
## UTMX             0.0000    0.2116 0.0283
## UTMY             0.0000    0.0000 0.0460
## STAND_AGE        0.9044    0.0204 0.0715
## BASAL_AREA       0.7036    0.0423 0.0022
## DOM_HEIGHT       0.0076    0.8439 0.0403
## SPECIES_RICHNESS 0.0261    0.6487 0.0054
## SHANNON_INDEX    0.5621    0.2398 0.0129
## GAP_FRACTION     0.8946    0.0008 0.0000
## CANOPY_COVER     0.8946    0.0008 0.0000
## ELEVATION                  0.0000 0.4161
## SLOPE            0.0000           0.2223
## ASPECT           0.4161    0.2223

3.6. Variance Inflation Factor (VIF)

To assess multicollinearity among predictor variables, we computed the Variance Inflation Factor (VIF) using a multiple linear regression model for the year 2013 as example. The model included spatial, structural, biodiversity, canopy, and topographic predictors associated with carbon stocks (C_STOCKS).

VIF quantifies how much the variance of a regression coefficient is inflated due to linear correlations with other predictors. It is used to detect redundancy in explanatory variables.

A VIF > 5 suggests moderate multicollinearity.
A VIF > 10 is generally considered problematic and may require variable removal or transformation.

The results help identify potential multicollinearity issues that could affect the stability and interpretability of model coefficients in subsequent modeling steps.

Example for 2013

modelC_2013VIF <- lm(C_STOCKS ~ UTMX + UTMY + STAND_AGE + BASAL_AREA + DOM_HEIGHT + SPECIES_RICHNESS + SHANNON_INDEX + GAP_FRACTION + ELEVATION + SLOPE + ASPECT, data = data2013)
vif(modelC_2013VIF)

##             UTMX             UTMY        STAND_AGE       BASAL_AREA 
##         1.969658         2.562707         1.380057         4.162700 
##       DOM_HEIGHT SPECIES_RICHNESS    SHANNON_INDEX     GAP_FRACTION 
##         2.317511         5.625841         5.966160         3.798482 
##        ELEVATION            SLOPE           ASPECT 
##         3.869923         1.694583         1.105195

3.7. Train/Test Split

Before fitting the Random Forest models, we split the dataset into training and validation subsets to evaluate model performance on unseen data. This step is critical to avoid overfitting and ensure the generalizability of the model.

We used a random sampling approach for the year 2013, allocating:

85% of the data for training, used to fit the model.
15% for validation, used to assess prediction accuracy.

Stratification by organic layer (LAYER) was applied to maintain the proportion of L and FH samples across both subsets. This helps preserve the distributional characteristics of each layer during model training and evaluation.

Example for 2013

#set.seed(123)
#train_idx <- sample(seq_len(nrow(data2013)), size = 0.85 * nrow(data2013))
#training_2013 <- data2013[train_idx, ]
#testing_2013 <- data2013[-train_idx, ]

#write_csv(training_2013, here("data", "training_2013_85.csv"))
#write_csv(testing_2013, here("data", "testing_2013_15.csv"))

3.8. Conclusion

This exploratory analysis identified key variables, distributional patterns, and correlation structures to inform future Random Forest modeling stages. All outputs are saved in the /outputs folder, and this report is rendered to /docs for easy viewing.

4. FEATURE SELECTION PROCEDURE

To improve model performance and interpretability, we performed a feature selection procedure (FSP) prior to model training. This step helps to identify the most relevant predictor variables while reducing noise and minimizing redundancy among features.

The procedure was applied using the training subset only, to avoid information leakage and ensure proper model validation. Selection was based on a combination of statistical criteria, including:

Correlation analysis to eliminate highly collinear variables.
Variance Inflation Factor (VIF) to detect multicollinearity.
Variable importance rankings generated during preliminary Random Forest runs.

By filtering out less informative or redundant predictors, this step ensures that the final model remains parsimonious, robust, and interpretable.

4.1. Load train/test datasets

In this section, we load the training and testing datasets previously created during the data splitting stage (Section 3.7). The subsets were saved as .csv files to ensure reproducibility and consistency across different modeling steps.

As an example, we load the datasets corresponding to the year 2013, which were split with an 85/15 ratio for training and validation, respectively. After loading, we ensure that categorical variables (YEAR, LAYER, and CONDITION) are properly defined as factors, which is necessary for model fitting and stratified evaluation.

# Load train and test datasets for 2013
data_train2013 <- read_csv(here("data", "training_2013_85.csv"))
data_test2013  <- read_csv(here("data", "testing_2013_15.csv"))

# Convert grouping variables to factors
data_train2013$YEAR      <- as.factor(data_train2013$YEAR)
data_train2013$LAYER     <- as.factor(data_train2013$LAYER)
data_train2013$CONDITION <- as.factor(data_train2013$CONDITION)

data_test2013$YEAR       <- as.factor(data_test2013$YEAR)
data_test2013$LAYER      <- as.factor(data_test2013$LAYER)
data_test2013$CONDITION  <- as.factor(data_test2013$CONDITION)

#Select covariates
names(data_train2013)

##  [1] "YEAR"             "PSP"              "SITE"             "LAYER"           
##  [5] "C_STOCKS"         "UTMX"             "UTMY"             "STAND_AGE"       
##  [9] "BASAL_AREA"       "DOM_HEIGHT"       "SPECIES_RICHNESS" "SHANNON_INDEX"   
## [13] "GAP_FRACTION"     "CANOPY_COVER"     "ELEVATION"        "SLOPE"           
## [17] "ASPECT"           "CONDITION"        "LAYER_NUM"        "CON_NUM"

data2013_formodels=data.frame(data_train2013[c(1,4,5,6:18)])
names(data2013_formodels)=c("YEAR","FFLAYER","C",
                            "UTMX","UTMY",
                            "AGE","BA","DTH","RICH",
                            "H","GAP","COV","ELE","SLO","ASP","MAN")

4.2. Boruta algorithm (ML tool of RF)

To identify the most relevant predictors and avoid overfitting due to the inclusion of irrelevant variables, we applied the Boruta algorithm, a wrapper method specifically designed for feature selection in machine learning models—particularly effective with Random Forest (RF).

Boruta works by comparing the importance of real features with that of randomized (shadow) features. It iteratively removes variables that consistently show lower importance than the best-performing shadow attribute. This ensures that only variables that are statistically significantly more informative than noise are retained.

We tested Boruta’s performance using different values of maxRuns (100, 500, and 1000) to evaluate the stability and robustness of the selected features. All runs used consistent model parameters (ntree = 500, mtry = 4, pValue = 0.5), and were performed using the training data for 2013 (data2013_formodels).

# MaxRuns = 100
borutaRF_traindf_C13_100runs <- Boruta(C ~ UTMX + UTMY + ELE + SLO + ASP + AGE + DTH + BA +
                                         GAP + COV + RICH + H + MAN + FFLAYER,
                                       data = data2013_formodels,
                                       doTrace = 3, pValue = 0.5, ntree = 500, mtry = 4)
print(borutaRF_traindf_C13_100runs$finalDecision)

##   FFLAYER       MAN         H      RICH       COV       GAP        BA       DTH 
## Confirmed  Rejected  Rejected  Rejected Confirmed Confirmed Confirmed Confirmed 
##       AGE       ASP       SLO       ELE      UTMY      UTMX 
##  Rejected  Rejected  Rejected  Rejected  Rejected  Rejected 
## Levels: Tentative Confirmed Rejected

BORzscores_traindf_C13_100runs <- attStats(borutaRF_traindf_C13_100runs)

# MaxRuns = 500
borutaRF_traindf_C13_500runs <- Boruta(C ~ UTMX + UTMY + ELE + SLO + ASP + AGE + DTH + BA +
                                         GAP + COV + RICH + H + MAN + FFLAYER,
                                       data = data2013_formodels,
                                       doTrace = 3, pValue = 0.5, ntree = 500, mtry = 4, maxRuns = 500)
print(borutaRF_traindf_C13_500runs$finalDecision)

##   FFLAYER       MAN         H      RICH       COV       GAP        BA       DTH 
## Confirmed  Rejected  Rejected  Rejected Confirmed Confirmed Confirmed Confirmed 
##       AGE       ASP       SLO       ELE      UTMY      UTMX 
## Tentative  Rejected  Rejected  Rejected  Rejected  Rejected 
## Levels: Tentative Confirmed Rejected

BORzscores_traindf_C13_500runs <- attStats(borutaRF_traindf_C13_500runs)
plot(borutaRF_traindf_C13_500runs)

# MaxRuns = 1000
borutaRF_traindf_C13_1000runs <- Boruta(C ~ UTMX + UTMY + ELE + SLO + ASP + AGE + DTH + BA +
                                          GAP + COV + RICH + H + MAN + FFLAYER,
                                        data = data2013_formodels,
                                        doTrace = 3, pValue = 0.5, ntree = 500, mtry = 4, maxRuns = 1000)
print(borutaRF_traindf_C13_1000runs$finalDecision)

##   FFLAYER       MAN         H      RICH       COV       GAP        BA       DTH 
## Confirmed  Rejected  Rejected Tentative Confirmed Confirmed Confirmed Confirmed 
##       AGE       ASP       SLO       ELE      UTMY      UTMX 
##  Rejected  Rejected  Rejected  Rejected  Rejected  Rejected 
## Levels: Tentative Confirmed Rejected

BORzscores_traindf_C13_1000runs <- attStats(borutaRF_traindf_C13_1000runs)

4.2.1. Comparison of Feature Importance Across Boruta Runs

The table below summarizes the final decision and mean importance score for each predictor across the three Boruta runs (maxRuns = 100, 500, and 1000). The graph provides a visual comparison of the average importance of each variable, helping to evaluate the consistency of Boruta’s selection process.

# Extract statistics from Boruta runs
BOR100 <- attStats(borutaRF_traindf_C13_100runs)
BOR500 <- attStats(borutaRF_traindf_C13_500runs)
BOR1000 <- attStats(borutaRF_traindf_C13_1000runs)

# Add run identifier
BOR100$Variable <- rownames(BOR100)
BOR500$Variable <- rownames(BOR500)
BOR1000$Variable <- rownames(BOR1000)
BOR100$Run <- "maxRuns = 100"
BOR500$Run <- "maxRuns = 500"
BOR1000$Run <- "maxRuns = 1000"

# Combine all runs
BOR_all <- rbind(BOR100, BOR500, BOR1000)

# Select and reshape for table
library(dplyr)
library(tidyr)
bor_table <- BOR_all %>%
  select(Variable, meanImp, decision, Run) %>%
  pivot_wider(names_from = Run, values_from = c(meanImp, decision))

bor_table

## # A tibble: 14 × 7
##    Variable meanImp_maxRuns = 10…¹ meanImp_maxRuns = 50…² meanImp_maxRuns = 10…³
##    <chr>                     <dbl>                  <dbl>                  <dbl>
##  1 FFLAYER                 65.2                    78.9                  79.3   
##  2 MAN                     -0.0175                  0.212                 0.720 
##  3 H                        0.384                   1.09                  1.87  
##  4 RICH                     0.749                   0.666                 2.26  
##  5 COV                      4.92                    6.14                  6.36  
##  6 GAP                      5.25                    6.03                  6.22  
##  7 BA                       9.75                   11.5                  11.5   
##  8 DTH                      8.22                    8.16                  8.65  
##  9 AGE                      1.80                    2.23                  1.37  
## 10 ASP                     -1.17                   -0.725                -1.64  
## 11 SLO                     -0.0349                 -0.583                -0.439 
## 12 ELE                      0.488                   0.583                 0.138 
## 13 UTMY                     0.300                  -0.629                -0.572 
## 14 UTMX                    -0.0370                 -1.61                 -0.0424
## # ℹ abbreviated names: ¹`meanImp_maxRuns = 100`, ²`meanImp_maxRuns = 500`,
## #   ³`meanImp_maxRuns = 1000`
## # ℹ 3 more variables: `decision_maxRuns = 100` <fct>,
## #   `decision_maxRuns = 500` <fct>, `decision_maxRuns = 1000` <fct>

# Create importance plot
ggplot(BOR_all, aes(x = Variable, y = meanImp, color = Run, group = Run)) +
  geom_line(linewidth = 0.7) +
  geom_point(size = 2) +
  labs(title = "Comparison of Mean Importance Across Boruta Runs",
       x = "Predictor Variable",
       y = "Mean Importance Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Este código: - Extrae los resultados de cada corrida con attStats() - Construye una tabla comparativa amplia - Genera un gráfico de líneas para visualizar la estabilidad de la importancia media

4.2.2. Confirmed Features: Variability Based on Importance Range

The figure below focuses on Confirmed predictors across Boruta runs. Since the Boruta output does not include the standard deviation (sdImp) by default, we represent importance variability using the range between the minimum and maximum importance values observed for each variable.

This provides a conservative estimate of how stable each variable’s importance was during the selection process.

# Combine and label Boruta outputs
BOR100$Variable <- rownames(BOR100)
BOR500$Variable <- rownames(BOR500)
BOR1000$Variable <- rownames(BOR1000)
BOR100$Run <- "maxRuns = 100"
BOR500$Run <- "maxRuns = 500"
BOR1000$Run <- "maxRuns = 1000"

BOR_all <- rbind(BOR100, BOR500, BOR1000)

# Filter confirmed variables only
library(dplyr)
library(ggplot2)

BOR_confirmed <- BOR_all %>%
  filter(decision == "Confirmed")

# Plot using minImp and maxImp as error bars
ggplot(BOR_confirmed, aes(x = Variable, y = meanImp, fill = Run)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
  geom_errorbar(aes(ymin = minImp, ymax = maxImp),
                position = position_dodge(width = 0.7), width = 0.3) +
  labs(title = "Confirmed Features: Importance Range Across Boruta Runs",
       x = "Predictor Variable", y = "Mean Importance (min–max range)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")

5. RANDOM FOREST MODELING

This section presents the core modeling framework using the Random Forest (RF) algorithm. The steps below were implemented using the randomForest and ranger packages in R, applied to training data previously filtered and prepared (We used the object data2013_formodels).

The workflow is divided into the following stages:

5.1 Model Training: Fit the RF model using selected predictors and tuned parameters.
5.2 Model Validation: Evaluate model performance using independent test data and internal accuracy metrics.
5.3 Uncertainty Analysis: Quantify model uncertainty through residual distributions, prediction intervals, or bootstrapping.
5.4 Model Prediction: Generate carbon stock predictions at unmeasured locations using spatial covariates.
5.5 Variable Importance and Non-linear Relationships: Analyze the relative contribution of each predictor and visualize key interactions using Partial Dependence Plots (PDPs).
5.6 Hotspot Assessment: Identify spatial clusters or “hotspots” of carbon accumulation using prediction surfaces and quantile thresholds.

Each modeling step is documented with code, diagnostics, and interpretation, following best practices in machine learning for environmental applications.

5.1. Graphical expressions

To ensure consistent and readable labeling across all plots in this section, we defined custom expressions for axis titles and figure annotations using expression() in R. These expressions allow the inclusion of mathematical notation (e.g., superscripts, bold text, and subscripts) commonly used in scientific reports.

# Axis and plot label expressions
Mgha_expression   <- expression("C stocks (Mg ha"^"-1"~")")
Mgha_expression2  <- expression(bold("a) C models (Mg ha"^"-1"~")"))
Mgha_expression3  <- expression("Mg ha"^"-1")

# Model label expressions for plot annotations or legends
initialmodel_expresion <- expression(paste("RF"[0]," (Default values)"))
finalmodel_expression   <- expression(paste("RF"[1], " (Optimized values)"))

5.2. Tuning hyperparameters

To improve the performance of the Random Forest (RF) model, we conducted a hyperparameter tuning procedure using a custom modeling function in combination with the caret package. Specifically, we tuned two key hyperparameters:

ntree: the number of trees to grow in the forest.
mtry: the number of variables randomly sampled at each split.

We created a custom model structure (customRF) to enable tuning of both parameters simultaneously, as the default implementation of caret::train() does not allow tuning ntree directly.

A grid search was performed over combinations of ntree values (from 500 to 1000) and mtry values (from 1 to 12). The tuning process was evaluated using repeated 10-fold cross-validation with 5 repetitions (repeatedcv) to reduce variance in performance estimates.

This procedure was applied using the 2013 training data (data_train2013), selecting only the relevant predictors for modeling. The resulting hyperparameter configuration was used for model optimization in the following sections.

# Define custom Random Forest model structure
customRF <- list(type = "Regression",
                 library = "randomForest",
                 loop = NULL)

customRF$parameters <- data.frame(parameter = c("ntree", "mtry"),
                                  class = rep("numeric", 2),
                                  label = c("ntree", "mtry"))

customRF$grid <- function(x, y, len = NULL, search = "grid") {}

customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
  randomForest(x, y, mtry = param$mtry, ntree = param$ntree, ...)
}

customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL) {
  predict(modelFit, newdata)
}

customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL) {
  predict(modelFit, newdata, type = "prob")
}

customRF$sort <- function(x) x[order(x[,1]),]
customRF$levels <- function(x) x$classes

# Define tuning grid
set.seed(12345)
tunegrid_all <- expand.grid(.ntree = c(500, 600, 700, 800, 900, 1000),
                            .mtry  = c(1:12))

# Define training control using repeated 10-fold CV
controltenfold_5rep <- trainControl(method = "repeatedcv",
                                    number = 10,
                                    repeats = 5,
                                    search = 'grid')

# Select predictors from training and test sets
all_Predictors2013_train <- data_train2013[c(4,8:18)]
all_Predictors2013_test <- data_test2013[c(4,8:18)]

5.3. Random Forest application

After defining the custom Random Forest structure and outlining the hyperparameter grid, we applied the model to the 2013 training data using the train() function from the caret package.

The model was trained using repeated 10-fold cross-validation with 5 repetitions (repeatedcv) to ensure stable performance estimates. The optimization metric was the Root Mean Square Error (RMSE), which reflects the accuracy of continuous predictions.

The output includes: - The best combination of hyperparameters (ntree and mtry) - The final Random Forest model object - A detailed summary of performance metrics for all hyperparameter combinations tested

# Apply Random Forest model using caret and custom settings
best_hyperpars_C2013_5rep <- train(x = all_Predictors2013_train,
                                   y = data_train2013$C_STOCKS,
                                   method = customRF,
                                   metric = "RMSE",
                                   tuneGrid = tunegrid_all,
                                   trControl = controltenfold_5rep)

# View summary of results
best_hyperpars_C2013_5rep
best_hyperpars_C2013_5rep$finalModel
View(best_hyperpars_C2013_5rep$results)

5.4. Detailed modeling procedure

Due to the extensive time requiring to run script. Check /scripts folder to follow the next modeling steps:

`3_RandomForest_Example.R`

This script performs the second step: RF modeling. It includes:

Fit RF modeling for C stocks data (2013)
Tuning hyperparameters
- The selection of the best combination were evaluated as iterations increases in order to detect error stability
- We tested 5, 10, 15, 20, 25, 50, 75 and 100 cross validation repetitions
- Metrics of error stability were assesed by Non parametric friedman test
- This script is an optimization framework for RF modeling
Assesing model uncertainty
- Definition of observed vs predicted values
Model validation
- Cross validation metrics
- Error stabilization assessment
Spatial predictions
- Map creation
- SD maps
Variable importance
- Comparison of Variable impiortance
- Partial plots and ecological analysis
- 3D Plotly graphs - Non linear relationships among dependant and predictor variable

Technical report - Random Forest modeling (Forest floor C stocks)

Machine learning Methodology

Dra. Zaira Rosario Pérez-Vázquez

Diciembre 2023

1. INTRODUCTION

2. REQUIRED R PACKAGES

3. EXPLORATORY ANALYSIS (STAGE 1)

3.1. Load data

3.2. Data Preparation

3.3. Summary Statistics

3.4. Data distribution (Histograms and Normality)

3.4.1. Histograms - L layer

3.4.2. Histograms - FH layer

3.4.3. Normality tests

3.5. Spearman Correlations

3.6. Variance Inflation Factor (VIF)

3.7. Train/Test Split

3.8. Conclusion

4. FEATURE SELECTION PROCEDURE

4.1. Load train/test datasets

4.2. Boruta algorithm (ML tool of RF)

4.2.1. Comparison of Feature Importance Across Boruta Runs

4.2.2. Confirmed Features: Variability Based on Importance Range

5. RANDOM FOREST MODELING

5.1. Graphical expressions

5.2. Tuning hyperparameters

5.3. Random Forest application

5.4. Detailed modeling procedure

`3_RandomForest_Example.R`