Train-Test Split for Machine Learning with R

Codes

October 18, 2024

To ensure the generalizability of the model, we split the data into training and testing sets, using the former to train the model and the latter to test it.

simple random sampling for train-test splits

set.seed(123)  
index <- sample(nrow(ecodata), round(nrow(ecodata)*0.8))  
training <- ecodata[index, ]  
test <- ecodata[-index, ]  

A train-test split can be performed using the caret, h2o, and rsample packages. However, the main challenge is often ensuring the response variable has the same distributions in both the training and test sets.
We can check the distributions of continious outcomes or proportions of classes of categorical varibales:

# continious response
ks.test(training$GCI, test$GCI)

# categorical response
churn <- modeldata::attrition

set.seed(123)
index2 <- sample(nrow(churn), round(nrow(churn)*0.8))
training2 <- churn[index, ]
test2 <- churn[-index, ]

table(training2$Attrition) %>% prop.table() 
# No 0.83 Yes 0.17
table(test2$Attrition) %>% prop.table() 
# No 0.84 Yes 0.16

stratified random sampling for train-test splits

The differences between train and test splits could be a problem if

for quantiative data, the sample size is small and response is highly skewed
for categorical response, there is huge imbalance between classes In those cases we might use stratified random sampling for the split

# stratified random sampling for categorical responses with the rsample package
set.seed(123)
split_strat  <- initial_split(churn, prop = 0.7, 
                              strata = "Attrition")
train_strat  <- training(split_strat)
test_strat   <- testing(split_strat)

# stratified random sampling for continious response with the cater package
library(caret)
set.seed(123)
trainIndex <- createDataPartition(ecodata$GCI, p = 0.8, list = FALSE)
training <- ecodata[trainIndex,]
test <- ecodata[-trainIndex,]