This technical report outlines the Random Forest modeling workflow
used to map carbon stocks in the organic soil layers (forest floor: L
and FH horizons). The process is divided into six stages, each
documented in separate scripts located within the /scripts
folder:
These packages are required for data manipulation, statistical analysis, and plotting throughout the modeling workflow.
library(agricolae)
library(akima)
library(Boruta)
library(biscale)
library(car)
library(caret)
library(CAST)
library(coin)
library(cowplot)
library(data.table)
library(DescTools)
library(doParallel)
library(dplyr)
library(e1071)
library(emmeans)
library(extrafont)
library(foreach)
library(forcats)
library(ggplot2)
library(ggpubr)
library(ggRandomForests)
library(grid)
library(gridExtra)
library(gt)
library(here)
library(Hmisc)
library(Kendall)
library(magick)
library(ModelMap)
library(moments)
library(parallel)
library(pdp)
library(PerformanceAnalytics)
library(plot3D)
library(plotmo)
library(psych)
library(purrr)
library(randomForest)
library(randomForestSRC)
library(raster)
library(RColorBrewer)
library(readr)
library(readxl)
library(rfUtilities)
library(rgdal)
library(sf)
library(sp)
library(stats)
library(tidyverse)
library(tools)
library(vivid)
library(plotly)
Important:
RF_data_Carbon_ForestFloor.csv
is an example
with a structure similar to the real data.🔒 Confidentiality Notice: If you have questions or need access to the real data, you can contact the repository author:
Zaira Rosario Pérez-Vázquez
zairpv@gmail.com
data_path <- here("data", "RF_data_Carbon_ForestFloor.csv")
datasetFF <- read_csv(data_path)
str(datasetFF)
## spc_tbl_ [918 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ YEAR : num [1:918] 2013 2013 2013 2013 2013 ...
## $ PSP : num [1:918] 1 1 1 1 1 1 1 1 2 2 ...
## $ SITE : num [1:918] 1 1 2 2 3 3 4 4 1 1 ...
## $ LAYER : chr [1:918] "L" "FH" "L" "FH" ...
## $ C_STOCKS : num [1:918] 2.48 17.22 2.11 13.51 2.46 ...
## $ UTMX : num [1:918] 541296 541296 541296 541296 541335 ...
## $ UTMY : num [1:918] 2280448 2280448 2280498 2280498 2280421 ...
## $ STAND_AGE : num [1:918] 29 29 29 29 30 30 29 29 22 22 ...
## $ BASAL_AREA : num [1:918] 32.5 32.5 21.3 21.3 26.9 ...
## $ DOM_HEIGHT : num [1:918] 26.5 26.5 22.8 22.8 23.7 ...
## $ SPECIES_RICHNESS: num [1:918] 10 10 9 9 12 12 8 8 4 4 ...
## $ SHANNON_INDEX : num [1:918] 1.5 1.5 1.97 1.97 2.14 ...
## $ GAP_FRACTION : num [1:918] 9.07 9.07 9.07 9.07 10.37 ...
## $ CANOPY_COVER : num [1:918] 90.9 90.9 90.9 90.9 89.6 ...
## $ ELEVATION : num [1:918] 2081 2081 2085 2085 2079 ...
## $ SLOPE : num [1:918] 4.06 4.06 5.1 5.1 4.69 ...
## $ ASPECT : num [1:918] 154 154 152 152 124 ...
## $ CONDITION : chr [1:918] "Managed" "Managed" "Managed" "Managed" ...
## $ LAYER_NUM : num [1:918] 1 2 1 2 1 2 1 2 1 2 ...
## $ CON_NUM : num [1:918] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. YEAR = col_double(),
## .. PSP = col_double(),
## .. SITE = col_double(),
## .. LAYER = col_character(),
## .. C_STOCKS = col_double(),
## .. UTMX = col_double(),
## .. UTMY = col_double(),
## .. STAND_AGE = col_double(),
## .. BASAL_AREA = col_double(),
## .. DOM_HEIGHT = col_double(),
## .. SPECIES_RICHNESS = col_double(),
## .. SHANNON_INDEX = col_double(),
## .. GAP_FRACTION = col_double(),
## .. CANOPY_COVER = col_double(),
## .. ELEVATION = col_double(),
## .. SLOPE = col_double(),
## .. ASPECT = col_double(),
## .. CONDITION = col_character(),
## .. LAYER_NUM = col_double(),
## .. CON_NUM = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The categorical variables are converted to factors to enable data grouping by sampling year, organic layer (L and FH), and forest management conditions.
datasetFF <- datasetFF %>%
mutate(
YEAR = as.factor(YEAR),
LAYER = as.factor(LAYER),
CONDITION = as.factor(CONDITION),
LAYER_NUM = as.factor(LAYER_NUM),
CON_NUM = as.factor(CON_NUM)
)
To explore the distribution of carbon stocks in the forest floor, we
computed descriptive statistics using the psych
package.
describe()
provides overall statistics for the
C_STOCKS
variable, including mean, standard deviation,
minimum, maximum, skewness, and kurtosis.describeBy()
is used to calculate the same summary
statistics, but grouped by sampling year and
organic layer (L and FH), which helps us assess
variation across both temporal and vertical gradients. Andescribe(datasetFF$C_STOCKS)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 918 4.31 3.88 2.81 3.56 1.9 0 26.25 26.25 2.2 5.83 0.13
describeBy(datasetFF$C_STOCKS, list(datasetFF$YEAR, datasetFF$LAYER))
##
## Descriptive statistics by group
## : 2013
## : FH
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 153 10.37 5.24 9.46 10.08 4.69 0 26.25 26.25 0.56 0.27 0.42
## ------------------------------------------------------------
## : 2018
## : FH
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 153 4.2 1.9 4.09 4.18 1.89 0 10.01 10.01 0.1 0.11 0.15
## ------------------------------------------------------------
## : 2023
## : FH
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 153 5.13 2.31 5.18 5.07 2.16 0.94 14.32 13.38 0.58 1.26 0.19
## ------------------------------------------------------------
## : 2013
## : L
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 153 2.06 0.79 1.96 2 0.59 0.44 5.44 4.99 1.23 3.46 0.06
## ------------------------------------------------------------
## : 2018
## : L
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 153 1.89 0.58 1.8 1.84 0.52 0.8 4 3.21 0.83 0.99 0.05
## ------------------------------------------------------------
## : 2023
## : L
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 153 2.23 0.87 2.14 2.18 0.78 0.66 4.87 4.2 0.63 0.23 0.07
To assess the distribution of carbon stocks (C_STOCKS
)
across sampling years (2013, 2018, 2023) and organic layers (L and FH),
we used histograms and normality diagnostics.
Histograms were generated using ggplot2
, displaying the
frequency distributions of C_STOCKS
faceted by year and
layer. This layout facilitates the identification of temporal and
vertical patterns in carbon accumulation.
L layer
# Filtrar datos por año y capa L
data2013_L <- subset(datasetFF, YEAR == "2013" & LAYER == "L")
data2018_L <- subset(datasetFF, YEAR == "2018" & LAYER == "L")
data2023_L <- subset(datasetFF, YEAR == "2023" & LAYER == "L")
FH layer
# Filtrar datos por año y capa FH
data2013_FH <- subset(datasetFF, YEAR == "2013" & LAYER == "FH")
data2018_FH <- subset(datasetFF, YEAR == "2018" & LAYER == "FH")
data2023_FH <- subset(datasetFF, YEAR == "2023" & LAYER == "FH")
To evaluate the normality of carbon stock distributions in the L layer, we applied a combined visual and statistical approach. For each year, we produced histograms overlaid with both the kernel density estimate (blue line) and a theoretical normal distribution (red line), calculated using the observed mean and standard deviation. A dashed orange line indicates the mean value of each distribution.
# Expression for x-axis label
Mgha_expression1 <- expression("Mg ha"^"-1")
# Plot for 2013 - L
HIST_CL13 <- ggplot(data2013_L, aes(x = C_STOCKS)) +
geom_histogram(aes(y=..density..), alpha=0.3, size=0.4, color="darkgrey", binwidth=0.5, fill="lightgray") +
stat_function(aes(color="theoretical normal distribution"),
fun=dnorm, args=list(mean=mean(data2013_L$C_STOCKS), sd=sd(data2013_L$C_STOCKS)), size=0.5) +
geom_density(aes(color="density"), size=0.5) +
geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=TRUE), color="mean"), linetype="dashed", size=0.8) +
scale_color_manual(name="Data distribution\n L layer:",
values=c("density"="blue", "theoretical normal distribution"="red", "mean"="orange"),
labels=c("Observed density", "Mean value", "Theoretical normal")) +
labs(x=Mgha_expression1, y="Density", title="2013") +
annotate("text", x=4.85, y=0.68, label="p = 0.069", family="A") +
theme_bw() +
theme(text=element_text(size=12, family="A"),
legend.position="bottom",
legend.title=element_text(hjust=0.5, size=11, face="bold"),
legend.text=element_text(size=11),
plot.title=element_text(size=14, face="bold", hjust=0.5),
axis.text=element_text(size=12))
# Plot for 2018 - L
HIST_CL18 <- ggplot(data2018_L, aes(x = C_STOCKS)) +
geom_histogram(aes(y=..density..), alpha=0.3, size=0.4, color="darkgrey", binwidth=0.5, fill="lightgray") +
stat_function(aes(color="theoretical normal distribution"),
fun=dnorm, args=list(mean=mean(data2018_L$C_STOCKS), sd=sd(data2018_L$C_STOCKS)), size=0.5) +
geom_density(aes(color="density"), size=0.5) +
geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=TRUE), color="mean"), linetype="dashed", size=0.8) +
scale_color_manual(name="Data distribution\n L layer:",
values=c("density"="blue", "theoretical normal distribution"="red", "mean"="orange"),
labels=c("Observed density", "Mean value", "Theoretical normal")) +
labs(x=Mgha_expression1, y="Density", title="2018") +
annotate("text", x=4.85, y=0.68, label="p = 0.069", family="A") +
theme_bw() +
theme(text=element_text(size=12, family="A"),
legend.position="bottom",
legend.title=element_text(hjust=0.5, size=11, face="bold"),
legend.text=element_text(size=11),
plot.title=element_text(size=14, face="bold", hjust=0.5),
axis.text=element_text(size=12))
# Plot for 2023 - L
HIST_CL23 <- ggplot(data2023_L, aes(x = C_STOCKS)) +
geom_histogram(aes(y=..density..), alpha=0.3, size=0.4, color="darkgrey", binwidth=0.5, fill="lightgray") +
stat_function(aes(color="theoretical normal distribution"),
fun=dnorm, args=list(mean=mean(data2023_L$C_STOCKS), sd=sd(data2023_L$C_STOCKS)), size=0.5) +
geom_density(aes(color="density"), size=0.5) +
geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=TRUE), color="mean"), linetype="dashed", size=0.8) +
scale_color_manual(name="Data distribution\n L layer:",
values=c("density"="blue", "theoretical normal distribution"="red", "mean"="orange"),
labels=c("Observed density", "Mean value", "Theoretical normal")) +
labs(x=Mgha_expression1, y="Density", title="2023") +
annotate("text", x=4.85, y=0.68, label="p = 0.069", family="A") +
theme_bw() +
theme(text=element_text(size=12, family="A"),
legend.position="bottom",
legend.title=element_text(hjust=0.5, size=11, face="bold"),
legend.text=element_text(size=11),
plot.title=element_text(size=14, face="bold", hjust=0.5),
axis.text=element_text(size=12))
# Combine the plots into a single panel
Supp1 <- ggarrange(HIST_CL13, HIST_CL18, HIST_CL23,
ncol=3, nrow=1,
common.legend=TRUE, legend="bottom")
# Display the panel
Supp1
# Plot for 2013 - FH
HIST_CFH13=ggplot(data2013_FH, aes(x = C_STOCKS)) +
geom_histogram(aes(y=..density..),alpha=0.3, size = 0.4, color = "darkgrey",binwidth = 4, fill = "lightgray", color = "black") +
geom_density(color = "blue", size = 0.5,position = "identity") +
labs(x = Mgha_expression1, y = "Density", title = "a) C stocks")+
annotate("text", x=23.1, y=0.09, label= "p = 0.394",family = "A") +
geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=T)), # Ignore NA values for mean
color="orange", linetype="dashed", size=0.8)+
stat_function(fun = dnorm, args = list(mean = mean(data2013_FH$C_STOCKS),
sd = sd(data2013_FH$C_STOCKS)), color = "red", size = 0.5)+
theme_bw()+
theme(text = element_text(size=11,family = "A"),
plot.title = element_text(size = 14,face = "bold",hjust = 0.5),
plot.subtitle = element_text(size = 11,face = "bold",hjust = 0),
axis.text = element_text(size = 11))+
labs(title = "2013")
# Plot for 2018 - FH
HIST_CFH18=ggplot(data2018_FH, aes(x = C_STOCKS)) +
geom_histogram(aes(y=..density..),alpha=0.3, size = 0.4, color = "darkgrey",binwidth = 1.5, fill = "lightgray", color = "black") +
geom_density(color = "blue", size = 0.5,position = "identity") +
labs(x = Mgha_expression1, y = "Density", title = "a) C stocks")+
annotate("text", x=8.6, y=0.22, label= "p = 0.825",family = "A") +
geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=T)), # Ignore NA values for mean
color="orange", linetype="dashed", size=0.8)+
stat_function(fun = dnorm, args = list(mean = mean(data2018_FH$C_STOCKS),
sd = sd(data2018_FH$C_STOCKS)), color = "red", size = 0.5)+
theme_bw()+
theme(text = element_text(size=11,family = "A"),
plot.title = element_text(size = 14,face = "bold",hjust = 0.5),
plot.subtitle = element_text(size = 11,face = "bold",hjust = 0),
axis.text = element_text(size = 11))+
labs(title = "2018")
# Plot for 2023 - FH
HIST_CFH23=ggplot(data2023_FH, aes(x = C_STOCKS)) +
geom_histogram(aes(y=..density..),alpha=0.3, size = 0.4, color = "darkgrey",binwidth = 2.5, fill = "lightgray", color = "black") +
geom_density(color = "blue", size = 0.5,position = "identity") +
labs(x = Mgha_expression1, y = "Density", title = "a) C stocks")+
annotate("text", x=13, y=0.18, label= "p = 0.788",family = "A") +
geom_vline(aes(xintercept=mean(C_STOCKS, na.rm=T)), # Ignore NA values for mean
color="orange", linetype="dashed", size=0.8)+
stat_function(fun = dnorm, args = list(mean = mean(data2023_FH$C_STOCKS),
sd = sd(data2023_FH$C_STOCKS)), color = "red", size = 0.5)+
theme_bw()+
theme(text = element_text(size=11,family = "A"),
plot.title = element_text(size = 14,face = "bold",hjust = 0.5),
plot.subtitle = element_text(size = 11,face = "bold",hjust = 0),
axis.text = element_text(size = 11))+
labs(title = "2023")
# Combine the plots into a single panel
Supp2=ggarrange(HIST_CFH13,HIST_CFH18,HIST_CFH23,
ncol = 3,nrow = 1,
common.legend = TRUE,legend="bottom")
# Display the panel
Supp2
Additionally, the Kolmogorov-Smirnov test was used to formally assess whether the data deviated significantly from a normal distribution. These results were visualized separately per year and integrated into a single panel for clarity.
In all three cases, the test returned p-values greater than 0.5, indicating that the null hypothesis of normality could not be rejected. These results suggest that the carbon stock data for the L layer is reasonably consistent with a normal distribution across the evaluated years.
L layer
ks.test(data2013_L$C_STOCKS,
pnorm,
mean(data2013_L$C_STOCKS),
sd(data2013_L$C_STOCKS))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data2013_L$C_STOCKS
## D = 0.1049, p-value = 0.06897
## alternative hypothesis: two-sided
ks.test(data2018_L$C_STOCKS,pnorm,mean(data2018_L$C_STOCKS),sd(data2018_L$C_STOCKS))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data2018_L$C_STOCKS
## D = 0.092277, p-value = 0.1477
## alternative hypothesis: two-sided
ks.test(data2023_L$C_STOCKS,pnorm,mean(data2023_L$C_STOCKS),sd(data2023_L$C_STOCKS))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data2023_L$C_STOCKS
## D = 0.069741, p-value = 0.4463
## alternative hypothesis: two-sided
FH layer
ks.test(data2013_FH$C_STOCKS,
pnorm,
mean(data2013_FH$C_STOCKS),
sd(data2013_FH$C_STOCKS))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data2013_FH$C_STOCKS
## D = 0.072686, p-value = 0.394
## alternative hypothesis: two-sided
ks.test(data2018_FH$C_STOCKS,pnorm,mean(data2018_FH$C_STOCKS),sd(data2018_FH$C_STOCKS))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data2018_FH$C_STOCKS
## D = 0.050761, p-value = 0.8254
## alternative hypothesis: two-sided
ks.test(data2023_FH$C_STOCKS,pnorm,mean(data2023_FH$C_STOCKS),sd(data2023_FH$C_STOCKS))
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data2023_FH$C_STOCKS
## D = 0.052754, p-value = 0.7881
## alternative hypothesis: two-sided
To explore monotonic relationships between carbon stocks and a set of biotic and abiotic variables, we computed Spearman’s rank correlation coefficients for the year 2013. This non-parametric method is suitable for assessing correlations without assuming normality or linearity in the data.
The analysis included spatial coordinates (UTMX, UTMY), stand structure variables (STAND_AGE, BASAL_AREA, DOM_HEIGHT), biodiversity metrics (SPECIES_RICHNESS, SHANNON_INDEX), canopy features (GAP_FRACTION, CANOPY_COVER), and topographic attributes (ELEVATION, SLOPE, ASPECT).
The output provides:
Values close to ±1 represent strong associations. Significant
positive or negative correlations with carbon stocks
(C_STOCKS
) suggest variables that may influence or co-vary
with organic carbon accumulation in the forest floor during that
year.
Example for 2013
data2013 <- datasetFF %>% filter(YEAR == "2013")
corr_vars <- data2013 %>% select(C_STOCKS, UTMX, UTMY, STAND_AGE, BASAL_AREA, DOM_HEIGHT, SPECIES_RICHNESS, SHANNON_INDEX, GAP_FRACTION, CANOPY_COVER, ELEVATION, SLOPE, ASPECT)
rcorr(as.matrix(corr_vars), type = "spearman")
## C_STOCKS UTMX UTMY STAND_AGE BASAL_AREA DOM_HEIGHT
## C_STOCKS 1.00 0.02 0.08 0.12 0.19 0.19
## UTMX 0.02 1.00 -0.07 0.04 0.32 0.00
## UTMY 0.08 -0.07 1.00 0.13 0.17 0.50
## STAND_AGE 0.12 0.04 0.13 1.00 0.56 0.53
## BASAL_AREA 0.19 0.32 0.17 0.56 1.00 0.59
## DOM_HEIGHT 0.19 0.00 0.50 0.53 0.59 1.00
## SPECIES_RICHNESS 0.12 0.18 -0.03 0.49 0.53 0.42
## SHANNON_INDEX 0.11 0.13 0.08 0.54 0.43 0.47
## GAP_FRACTION -0.12 -0.36 -0.05 -0.51 -0.72 -0.42
## CANOPY_COVER 0.12 0.36 0.05 0.51 0.72 0.42
## ELEVATION 0.07 -0.42 0.65 0.01 -0.02 0.15
## SLOPE -0.07 0.07 -0.29 -0.13 -0.12 -0.01
## ASPECT 0.03 0.13 0.11 0.10 0.17 0.12
## SPECIES_RICHNESS SHANNON_INDEX GAP_FRACTION CANOPY_COVER
## C_STOCKS 0.12 0.11 -0.12 0.12
## UTMX 0.18 0.13 -0.36 0.36
## UTMY -0.03 0.08 -0.05 0.05
## STAND_AGE 0.49 0.54 -0.51 0.51
## BASAL_AREA 0.53 0.43 -0.72 0.72
## DOM_HEIGHT 0.42 0.47 -0.42 0.42
## SPECIES_RICHNESS 1.00 0.87 -0.61 0.61
## SHANNON_INDEX 0.87 1.00 -0.59 0.59
## GAP_FRACTION -0.61 -0.59 1.00 -1.00
## CANOPY_COVER 0.61 0.59 -1.00 1.00
## ELEVATION -0.13 -0.03 -0.01 0.01
## SLOPE -0.03 -0.07 0.19 -0.19
## ASPECT 0.16 0.14 -0.29 0.29
## ELEVATION SLOPE ASPECT
## C_STOCKS 0.07 -0.07 0.03
## UTMX -0.42 0.07 0.13
## UTMY 0.65 -0.29 0.11
## STAND_AGE 0.01 -0.13 0.10
## BASAL_AREA -0.02 -0.12 0.17
## DOM_HEIGHT 0.15 -0.01 0.12
## SPECIES_RICHNESS -0.13 -0.03 0.16
## SHANNON_INDEX -0.03 -0.07 0.14
## GAP_FRACTION -0.01 0.19 -0.29
## CANOPY_COVER 0.01 -0.19 0.29
## ELEVATION 1.00 -0.57 0.05
## SLOPE -0.57 1.00 -0.07
## ASPECT 0.05 -0.07 1.00
##
## n= 306
##
##
## P
## C_STOCKS UTMX UTMY STAND_AGE BASAL_AREA DOM_HEIGHT
## C_STOCKS 0.6860 0.1840 0.0379 0.0008 0.0008
## UTMX 0.6860 0.2476 0.5141 0.0000 0.9740
## UTMY 0.1840 0.2476 0.0204 0.0030 0.0000
## STAND_AGE 0.0379 0.5141 0.0204 0.0000 0.0000
## BASAL_AREA 0.0008 0.0000 0.0030 0.0000 0.0000
## DOM_HEIGHT 0.0008 0.9740 0.0000 0.0000 0.0000
## SPECIES_RICHNESS 0.0389 0.0016 0.6289 0.0000 0.0000 0.0000
## SHANNON_INDEX 0.0457 0.0259 0.1498 0.0000 0.0000 0.0000
## GAP_FRACTION 0.0423 0.0000 0.3943 0.0000 0.0000 0.0000
## CANOPY_COVER 0.0423 0.0000 0.3943 0.0000 0.0000 0.0000
## ELEVATION 0.2192 0.0000 0.0000 0.9044 0.7036 0.0076
## SLOPE 0.2284 0.2116 0.0000 0.0204 0.0423 0.8439
## ASPECT 0.6386 0.0283 0.0460 0.0715 0.0022 0.0403
## SPECIES_RICHNESS SHANNON_INDEX GAP_FRACTION CANOPY_COVER
## C_STOCKS 0.0389 0.0457 0.0423 0.0423
## UTMX 0.0016 0.0259 0.0000 0.0000
## UTMY 0.6289 0.1498 0.3943 0.3943
## STAND_AGE 0.0000 0.0000 0.0000 0.0000
## BASAL_AREA 0.0000 0.0000 0.0000 0.0000
## DOM_HEIGHT 0.0000 0.0000 0.0000 0.0000
## SPECIES_RICHNESS 0.0000 0.0000 0.0000
## SHANNON_INDEX 0.0000 0.0000 0.0000
## GAP_FRACTION 0.0000 0.0000 0.0000
## CANOPY_COVER 0.0000 0.0000 0.0000
## ELEVATION 0.0261 0.5621 0.8946 0.8946
## SLOPE 0.6487 0.2398 0.0008 0.0008
## ASPECT 0.0054 0.0129 0.0000 0.0000
## ELEVATION SLOPE ASPECT
## C_STOCKS 0.2192 0.2284 0.6386
## UTMX 0.0000 0.2116 0.0283
## UTMY 0.0000 0.0000 0.0460
## STAND_AGE 0.9044 0.0204 0.0715
## BASAL_AREA 0.7036 0.0423 0.0022
## DOM_HEIGHT 0.0076 0.8439 0.0403
## SPECIES_RICHNESS 0.0261 0.6487 0.0054
## SHANNON_INDEX 0.5621 0.2398 0.0129
## GAP_FRACTION 0.8946 0.0008 0.0000
## CANOPY_COVER 0.8946 0.0008 0.0000
## ELEVATION 0.0000 0.4161
## SLOPE 0.0000 0.2223
## ASPECT 0.4161 0.2223
To assess multicollinearity among predictor variables, we computed
the Variance Inflation Factor (VIF) using a multiple
linear regression model for the year 2013 as example. The model included
spatial, structural, biodiversity, canopy, and topographic predictors
associated with carbon stocks (C_STOCKS
).
VIF quantifies how much the variance of a regression coefficient is inflated due to linear correlations with other predictors. It is used to detect redundancy in explanatory variables.
The results help identify potential multicollinearity issues that could affect the stability and interpretability of model coefficients in subsequent modeling steps.
Example for 2013
modelC_2013VIF <- lm(C_STOCKS ~ UTMX + UTMY + STAND_AGE + BASAL_AREA + DOM_HEIGHT + SPECIES_RICHNESS + SHANNON_INDEX + GAP_FRACTION + ELEVATION + SLOPE + ASPECT, data = data2013)
vif(modelC_2013VIF)
## UTMX UTMY STAND_AGE BASAL_AREA
## 1.969658 2.562707 1.380057 4.162700
## DOM_HEIGHT SPECIES_RICHNESS SHANNON_INDEX GAP_FRACTION
## 2.317511 5.625841 5.966160 3.798482
## ELEVATION SLOPE ASPECT
## 3.869923 1.694583 1.105195
Before fitting the Random Forest models, we split the dataset into training and validation subsets to evaluate model performance on unseen data. This step is critical to avoid overfitting and ensure the generalizability of the model.
We used a random sampling approach for the year 2013, allocating:
Stratification by organic layer (LAYER
) was applied to
maintain the proportion of L and FH samples across both subsets. This
helps preserve the distributional characteristics of each layer during
model training and evaluation.
Example for 2013
#set.seed(123)
#train_idx <- sample(seq_len(nrow(data2013)), size = 0.85 * nrow(data2013))
#training_2013 <- data2013[train_idx, ]
#testing_2013 <- data2013[-train_idx, ]
#write_csv(training_2013, here("data", "training_2013_85.csv"))
#write_csv(testing_2013, here("data", "testing_2013_15.csv"))
This exploratory analysis identified key variables, distributional
patterns, and correlation structures to inform future Random Forest
modeling stages. All outputs are saved in the /outputs
folder, and this report is rendered to /docs
for easy
viewing.
To improve model performance and interpretability, we performed a feature selection procedure (FSP) prior to model training. This step helps to identify the most relevant predictor variables while reducing noise and minimizing redundancy among features.
The procedure was applied using the training subset only, to avoid information leakage and ensure proper model validation. Selection was based on a combination of statistical criteria, including:
By filtering out less informative or redundant predictors, this step ensures that the final model remains parsimonious, robust, and interpretable.
In this section, we load the training and testing datasets previously
created during the data splitting stage (Section 3.7). The subsets were
saved as .csv
files to ensure reproducibility and
consistency across different modeling steps.
As an example, we load the datasets corresponding to the year
2013, which were split with an 85/15
ratio for training and validation, respectively. After loading,
we ensure that categorical variables (YEAR
,
LAYER
, and CONDITION
) are properly defined as
factors, which is necessary for model fitting and stratified
evaluation.
# Load train and test datasets for 2013
data_train2013 <- read_csv(here("data", "training_2013_85.csv"))
data_test2013 <- read_csv(here("data", "testing_2013_15.csv"))
# Convert grouping variables to factors
data_train2013$YEAR <- as.factor(data_train2013$YEAR)
data_train2013$LAYER <- as.factor(data_train2013$LAYER)
data_train2013$CONDITION <- as.factor(data_train2013$CONDITION)
data_test2013$YEAR <- as.factor(data_test2013$YEAR)
data_test2013$LAYER <- as.factor(data_test2013$LAYER)
data_test2013$CONDITION <- as.factor(data_test2013$CONDITION)
#Select covariates
names(data_train2013)
## [1] "YEAR" "PSP" "SITE" "LAYER"
## [5] "C_STOCKS" "UTMX" "UTMY" "STAND_AGE"
## [9] "BASAL_AREA" "DOM_HEIGHT" "SPECIES_RICHNESS" "SHANNON_INDEX"
## [13] "GAP_FRACTION" "CANOPY_COVER" "ELEVATION" "SLOPE"
## [17] "ASPECT" "CONDITION" "LAYER_NUM" "CON_NUM"
data2013_formodels=data.frame(data_train2013[c(1,4,5,6:18)])
names(data2013_formodels)=c("YEAR","FFLAYER","C",
"UTMX","UTMY",
"AGE","BA","DTH","RICH",
"H","GAP","COV","ELE","SLO","ASP","MAN")
To identify the most relevant predictors and avoid overfitting due to the inclusion of irrelevant variables, we applied the Boruta algorithm, a wrapper method specifically designed for feature selection in machine learning models—particularly effective with Random Forest (RF).
Boruta works by comparing the importance of real features with that of randomized (shadow) features. It iteratively removes variables that consistently show lower importance than the best-performing shadow attribute. This ensures that only variables that are statistically significantly more informative than noise are retained.
We tested Boruta’s performance using different values of
maxRuns
(100, 500, and 1000) to evaluate the stability and
robustness of the selected features. All runs used consistent model
parameters (ntree = 500
, mtry = 4
,
pValue = 0.5
), and were performed using the training data
for 2013 (data2013_formodels
).
# MaxRuns = 100
borutaRF_traindf_C13_100runs <- Boruta(C ~ UTMX + UTMY + ELE + SLO + ASP + AGE + DTH + BA +
GAP + COV + RICH + H + MAN + FFLAYER,
data = data2013_formodels,
doTrace = 3, pValue = 0.5, ntree = 500, mtry = 4)
print(borutaRF_traindf_C13_100runs$finalDecision)
## FFLAYER MAN H RICH COV GAP BA DTH
## Confirmed Rejected Rejected Rejected Confirmed Confirmed Confirmed Confirmed
## AGE ASP SLO ELE UTMY UTMX
## Rejected Rejected Rejected Rejected Rejected Rejected
## Levels: Tentative Confirmed Rejected
BORzscores_traindf_C13_100runs <- attStats(borutaRF_traindf_C13_100runs)
# MaxRuns = 500
borutaRF_traindf_C13_500runs <- Boruta(C ~ UTMX + UTMY + ELE + SLO + ASP + AGE + DTH + BA +
GAP + COV + RICH + H + MAN + FFLAYER,
data = data2013_formodels,
doTrace = 3, pValue = 0.5, ntree = 500, mtry = 4, maxRuns = 500)
print(borutaRF_traindf_C13_500runs$finalDecision)
## FFLAYER MAN H RICH COV GAP BA DTH
## Confirmed Rejected Rejected Rejected Confirmed Confirmed Confirmed Confirmed
## AGE ASP SLO ELE UTMY UTMX
## Tentative Rejected Rejected Rejected Rejected Rejected
## Levels: Tentative Confirmed Rejected
BORzscores_traindf_C13_500runs <- attStats(borutaRF_traindf_C13_500runs)
plot(borutaRF_traindf_C13_500runs)
# MaxRuns = 1000
borutaRF_traindf_C13_1000runs <- Boruta(C ~ UTMX + UTMY + ELE + SLO + ASP + AGE + DTH + BA +
GAP + COV + RICH + H + MAN + FFLAYER,
data = data2013_formodels,
doTrace = 3, pValue = 0.5, ntree = 500, mtry = 4, maxRuns = 1000)
print(borutaRF_traindf_C13_1000runs$finalDecision)
## FFLAYER MAN H RICH COV GAP BA DTH
## Confirmed Rejected Rejected Tentative Confirmed Confirmed Confirmed Confirmed
## AGE ASP SLO ELE UTMY UTMX
## Rejected Rejected Rejected Rejected Rejected Rejected
## Levels: Tentative Confirmed Rejected
BORzscores_traindf_C13_1000runs <- attStats(borutaRF_traindf_C13_1000runs)
The table below summarizes the final decision and mean importance
score for each predictor across the three Boruta runs
(maxRuns = 100
, 500
, and 1000
).
The graph provides a visual comparison of the average importance of each
variable, helping to evaluate the consistency of Boruta’s selection
process.
# Extract statistics from Boruta runs
BOR100 <- attStats(borutaRF_traindf_C13_100runs)
BOR500 <- attStats(borutaRF_traindf_C13_500runs)
BOR1000 <- attStats(borutaRF_traindf_C13_1000runs)
# Add run identifier
BOR100$Variable <- rownames(BOR100)
BOR500$Variable <- rownames(BOR500)
BOR1000$Variable <- rownames(BOR1000)
BOR100$Run <- "maxRuns = 100"
BOR500$Run <- "maxRuns = 500"
BOR1000$Run <- "maxRuns = 1000"
# Combine all runs
BOR_all <- rbind(BOR100, BOR500, BOR1000)
# Select and reshape for table
library(dplyr)
library(tidyr)
bor_table <- BOR_all %>%
select(Variable, meanImp, decision, Run) %>%
pivot_wider(names_from = Run, values_from = c(meanImp, decision))
bor_table
## # A tibble: 14 × 7
## Variable meanImp_maxRuns = 10…¹ meanImp_maxRuns = 50…² meanImp_maxRuns = 10…³
## <chr> <dbl> <dbl> <dbl>
## 1 FFLAYER 65.2 78.9 79.3
## 2 MAN -0.0175 0.212 0.720
## 3 H 0.384 1.09 1.87
## 4 RICH 0.749 0.666 2.26
## 5 COV 4.92 6.14 6.36
## 6 GAP 5.25 6.03 6.22
## 7 BA 9.75 11.5 11.5
## 8 DTH 8.22 8.16 8.65
## 9 AGE 1.80 2.23 1.37
## 10 ASP -1.17 -0.725 -1.64
## 11 SLO -0.0349 -0.583 -0.439
## 12 ELE 0.488 0.583 0.138
## 13 UTMY 0.300 -0.629 -0.572
## 14 UTMX -0.0370 -1.61 -0.0424
## # ℹ abbreviated names: ¹`meanImp_maxRuns = 100`, ²`meanImp_maxRuns = 500`,
## # ³`meanImp_maxRuns = 1000`
## # ℹ 3 more variables: `decision_maxRuns = 100` <fct>,
## # `decision_maxRuns = 500` <fct>, `decision_maxRuns = 1000` <fct>
# Create importance plot
ggplot(BOR_all, aes(x = Variable, y = meanImp, color = Run, group = Run)) +
geom_line(linewidth = 0.7) +
geom_point(size = 2) +
labs(title = "Comparison of Mean Importance Across Boruta Runs",
x = "Predictor Variable",
y = "Mean Importance Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Este código: - Extrae los resultados de cada corrida con
attStats()
- Construye una tabla comparativa
amplia - Genera un gráfico de líneas para
visualizar la estabilidad de la importancia media
The figure below focuses on Confirmed predictors
across Boruta runs. Since the Boruta output does not include the
standard deviation (sdImp
) by default, we represent
importance variability using the range between the
minimum and maximum importance values
observed for each variable.
This provides a conservative estimate of how stable each variable’s importance was during the selection process.
# Combine and label Boruta outputs
BOR100$Variable <- rownames(BOR100)
BOR500$Variable <- rownames(BOR500)
BOR1000$Variable <- rownames(BOR1000)
BOR100$Run <- "maxRuns = 100"
BOR500$Run <- "maxRuns = 500"
BOR1000$Run <- "maxRuns = 1000"
BOR_all <- rbind(BOR100, BOR500, BOR1000)
# Filter confirmed variables only
library(dplyr)
library(ggplot2)
BOR_confirmed <- BOR_all %>%
filter(decision == "Confirmed")
# Plot using minImp and maxImp as error bars
ggplot(BOR_confirmed, aes(x = Variable, y = meanImp, fill = Run)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
geom_errorbar(aes(ymin = minImp, ymax = maxImp),
position = position_dodge(width = 0.7), width = 0.3) +
labs(title = "Confirmed Features: Importance Range Across Boruta Runs",
x = "Predictor Variable", y = "Mean Importance (min–max range)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom")
This section presents the core modeling framework using the
Random Forest (RF) algorithm. The steps below were
implemented using the randomForest
and ranger
packages in R, applied to training data previously filtered and prepared
(We used the object data2013_formodels
).
The workflow is divided into the following stages:
Each modeling step is documented with code, diagnostics, and interpretation, following best practices in machine learning for environmental applications.
To ensure consistent and readable labeling across all plots in this
section, we defined custom expressions for axis titles and figure
annotations using expression()
in R. These expressions
allow the inclusion of mathematical notation (e.g.,
superscripts, bold text, and subscripts) commonly used in scientific
reports.
# Axis and plot label expressions
Mgha_expression <- expression("C stocks (Mg ha"^"-1"~")")
Mgha_expression2 <- expression(bold("a) C models (Mg ha"^"-1"~")"))
Mgha_expression3 <- expression("Mg ha"^"-1")
# Model label expressions for plot annotations or legends
initialmodel_expresion <- expression(paste("RF"[0]," (Default values)"))
finalmodel_expression <- expression(paste("RF"[1], " (Optimized values)"))
To improve the performance of the Random Forest (RF) model, we
conducted a hyperparameter tuning procedure using a
custom modeling function in combination with the caret
package. Specifically, we tuned two key hyperparameters:
ntree
: the number of trees to grow in the forest.mtry
: the number of variables randomly sampled at each
split.We created a custom model structure (customRF
) to enable
tuning of both parameters simultaneously, as the default implementation
of caret::train()
does not allow tuning ntree
directly.
A grid search was performed over combinations of
ntree
values (from 500 to 1000) and mtry
values (from 1 to 12). The tuning process was evaluated using
repeated 10-fold cross-validation with 5 repetitions
(repeatedcv
) to reduce variance in performance
estimates.
This procedure was applied using the 2013 training data
(data_train2013
), selecting only the relevant predictors
for modeling. The resulting hyperparameter configuration was used for
model optimization in the following sections.
# Define custom Random Forest model structure
customRF <- list(type = "Regression",
library = "randomForest",
loop = NULL)
customRF$parameters <- data.frame(parameter = c("ntree", "mtry"),
class = rep("numeric", 2),
label = c("ntree", "mtry"))
customRF$grid <- function(x, y, len = NULL, search = "grid") {}
customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
randomForest(x, y, mtry = param$mtry, ntree = param$ntree, ...)
}
customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL) {
predict(modelFit, newdata)
}
customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL) {
predict(modelFit, newdata, type = "prob")
}
customRF$sort <- function(x) x[order(x[,1]),]
customRF$levels <- function(x) x$classes
# Define tuning grid
set.seed(12345)
tunegrid_all <- expand.grid(.ntree = c(500, 600, 700, 800, 900, 1000),
.mtry = c(1:12))
# Define training control using repeated 10-fold CV
controltenfold_5rep <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
search = 'grid')
# Select predictors from training and test sets
all_Predictors2013_train <- data_train2013[c(4,8:18)]
all_Predictors2013_test <- data_test2013[c(4,8:18)]
After defining the custom Random Forest structure and outlining the
hyperparameter grid, we applied the model to the 2013 training
data using the train()
function from the
caret
package.
The model was trained using repeated 10-fold cross-validation with 5
repetitions (repeatedcv
) to ensure stable performance
estimates. The optimization metric was the Root Mean Square
Error (RMSE), which reflects the accuracy of continuous
predictions.
The output includes: - The best combination of hyperparameters
(ntree
and mtry
) - The final Random Forest
model object - A detailed summary of performance metrics for all
hyperparameter combinations tested
# Apply Random Forest model using caret and custom settings
best_hyperpars_C2013_5rep <- train(x = all_Predictors2013_train,
y = data_train2013$C_STOCKS,
method = customRF,
metric = "RMSE",
tuneGrid = tunegrid_all,
trControl = controltenfold_5rep)
# View summary of results
best_hyperpars_C2013_5rep
best_hyperpars_C2013_5rep$finalModel
View(best_hyperpars_C2013_5rep$results)
Due to the extensive time requiring to run script. Check
/scripts
folder to follow the next modeling steps:
3_RandomForest_Example.R
This script performs the second step: RF modeling. It includes: