Train-Test Split for Machine Learning with R
To ensure the generalizability of the model, we split the data into training and testing sets, using the former to train the model and the latter to test it.
simple random sampling for train-test splits
set.seed(123)
index <- sample(nrow(ecodata), round(nrow(ecodata)*0.8))
training <- ecodata[index, ]
test <- ecodata[-index, ]
A train-test split can be performed using the caret, h2o, and rsample packages. However, the main challenge is often ensuring the response variable has the same distributions in both the training and test sets.
We can check the distributions of continious outcomes or proportions of classes of categorical varibales:
# continious response
ks.test(training$GCI, test$GCI)
# categorical response
churn <- modeldata::attrition
set.seed(123)
index2 <- sample(nrow(churn), round(nrow(churn)*0.8))
training2 <- churn[index, ]
test2 <- churn[-index, ]
table(training2$Attrition) %>% prop.table()
# No 0.83 Yes 0.17
table(test2$Attrition) %>% prop.table()
# No 0.84 Yes 0.16
stratified random sampling for train-test splits
The differences between train and test splits could be a problem if
- for quantiative data, the sample size is small and response is highly skewed
- for categorical response, there is huge imbalance between classes In those cases we might use stratified random sampling for the split
# stratified random sampling for categorical responses with the rsample package
set.seed(123)
split_strat <- initial_split(churn, prop = 0.7,
strata = "Attrition")
train_strat <- training(split_strat)
test_strat <- testing(split_strat)
# stratified random sampling for continious response with the cater package
library(caret)
set.seed(123)
trainIndex <- createDataPartition(ecodata$GCI, p = 0.8, list = FALSE)
training <- ecodata[trainIndex,]
test <- ecodata[-trainIndex,]