Train-Test Split for Machine Learning with R

To ensure the generalizability of the model, we split the data into training and testing sets, using the former to train the model and the latter to test it.

simple random sampling for train-test splits

set.seed(123)  
index <- sample(nrow(ecodata), round(nrow(ecodata)*0.8))  
training <- ecodata[index, ]  
test <- ecodata[-index, ]  

A train-test split can be performed using the caret, h2o, and rsample packages. However, the main challenge is often ensuring the response variable has the same distributions in both the training and test sets.
We can check the distributions of continious outcomes or proportions of classes of categorical varibales:

# continious response
ks.test(training$GCI, test$GCI)
# categorical response
churn <- modeldata::attrition

set.seed(123)
index2 <- sample(nrow(churn), round(nrow(churn)*0.8))
training2 <- churn[index, ]
test2 <- churn[-index, ]

table(training2$Attrition) %>% prop.table() 
# No 0.83 Yes 0.17
table(test2$Attrition) %>% prop.table() 
# No 0.84 Yes 0.16

stratified random sampling for train-test splits

The differences between train and test splits could be a problem if

# stratified random sampling for categorical responses with the rsample package
set.seed(123)
split_strat  <- initial_split(churn, prop = 0.7, 
                              strata = "Attrition")
train_strat  <- training(split_strat)
test_strat   <- testing(split_strat)

# stratified random sampling for continious response with the cater package
library(caret)
set.seed(123)
trainIndex <- createDataPartition(ecodata$GCI, p = 0.8, list = FALSE)
training <- ecodata[trainIndex,]
test <- ecodata[-trainIndex,]