caret教程05:类不平衡数据的处理

分类问题中结果变量的类不平衡很常见,一个类明显多于另一种,这样的结果会产生很大影响。

目前处理这种类不平衡的方法主要有:

  • sub-sampling
    • down-sampling(下采样,欠采样):假如一个2分类变量有80%的A,20%的B,为了和B匹配,那就只选择20%的A,剩下的不用了
    • up-sampling(上采样,过采样):假如一个2分类变量有80%的A,20%的B,为了和A匹配,那就有放回的选择80%的A
  • SMOTEROSE:同时增加少的类和减少多的类,需要借助DMwRROSE包。

除此之外,有些算法在自带权重选项,可以给类少的类别配上更高的权重,这样也可以减少类不平衡带来的影响。

关于这几种方法的详细解释,大家可以自行了解,我们主要演示caret中的实现方法。

4种方法演示

虚构一个数据,训练集和测试集各有10000个样本,其中占比低的类大概只有5.9%的比例,数据如下:

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice

set.seed(2969)
imbal_train <- twoClassSim(10000, intercept = -20, linearVars = 20)
imbal_test  <- twoClassSim(10000, intercept = -20, linearVars = 20)
table(imbal_train$Class)
## 
## Class1 Class2 
##   9411    589

可以看到两个类差别非常大!

下面演示caret中的sub-sampling

首先是download-sampling

set.seed(9560)
down_train <- downSample(x = imbal_train[, -ncol(imbal_train)],# 预测变量
                         y = imbal_train$Class) # 结果变量
table(down_train$Class)   
## 
## Class1 Class2 
##    589    589

然后是up-sampling

set.seed(9560)
up_train <- upSample(x = imbal_train[, -ncol(imbal_train)],
                     y = imbal_train$Class)                         
table(up_train$Class) 
## 
## Class1 Class2 
##   9411   9411

然后是SMOTE:

DMwR这个包默认方式肯定安装不了,需要下载到本地安装或者github安装,如果你不会,赶紧翻看我的视频教程:R语言必知的4种R包安装方式

library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

set.seed(9560)
smote_train <- SMOTE(Class ~ ., data  = imbal_train)                         
table(smote_train$Class) 
## 
## Class1 Class2 
##   2356   1767

最后是ROSE:

library(ROSE)
## Loaded ROSE 0.0-4

set.seed(9560)
rose_train <- ROSE(Class ~ ., data  = imbal_train)$data                         
table(rose_train$Class) 
## 
## Class1 Class2 
##   4939   5061

使用袋装法评估这几种方法配平后的数据,重抽样方法选择重复10折交叉验证,评价指标选择AUC。

# 首先加个速
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
cl <- makePSOCKcluster(16)
registerDoParallel(cl)

# 然后训练模型
# 选择重抽样方法和评价指标
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)

# 原始训练集
set.seed(5627)
orig_fit <- train(Class ~ ., data = imbal_train, 
                  method = "treebag",
                  nbagg = 50,
                  metric = "ROC",
                  trControl = ctrl)

# down sampling处理过的训练集
set.seed(5627)
down_outside <- train(Class ~ ., data = down_train, 
                      method = "treebag",
                      nbagg = 50,
                      metric = "ROC",
                      trControl = ctrl)

# up sampling处理过的训练集
set.seed(5627)
up_outside <- train(Class ~ ., data = up_train, 
                    method = "treebag",
                    nbagg = 50,
                    metric = "ROC",
                    trControl = ctrl)

# smote
set.seed(5627)
rose_outside <- train(Class ~ ., data = rose_train, 
                      method = "treebag",
                      nbagg = 50,
                      metric = "ROC",
                      trControl = ctrl)

# rose
set.seed(5627)
smote_outside <- train(Class ~ ., data = smote_train, 
                       method = "treebag",
                       nbagg = 50,
                       metric = "ROC",
                       trControl = ctrl)

stopCluster(cl)

查看不同模型的表现:

outside_models <- list(original = orig_fit,
                       down = down_outside,
                       up = up_outside,
                       SMOTE = smote_outside,
                       ROSE = rose_outside)

# 提取训练集中的模型表现
outside_resampling <- resamples(outside_models)
summary(outside_resampling, metric = "ROC")
## 
## Call:
## summary.resamples(object = outside_resampling, metric = "ROC")
## 
## Models: original, down, up, SMOTE, ROSE 
## Number of resamples: 50 
## 
## ROC 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## original 0.9098237 0.9298348 0.9386021 0.9394130 0.9493394 0.9685873    0
## down     0.9095558 0.9282175 0.9453907 0.9438384 0.9596021 0.9836254    0
## up       0.9989350 0.9999980 1.0000000 0.9998402 1.0000000 1.0000000    0
## SMOTE    0.9697171 0.9782214 0.9834234 0.9817476 0.9857071 0.9928255    0
## ROSE     0.8782985 0.8941488 0.8980313 0.8993135 0.9056404 0.9203092    0

可以看到up-sampling效果最好,ROSE效果最差。

下面定义一个函数,分别用训练好的模型作用于测试集,并计算AUC:

# 编写一个函数,计算测试集的表现
test_roc <- function(model, data) {
  library(pROC)
  roc_obj <- roc(data$Class, 
                 predict(model, data, type = "prob")[, "Class1"],
                 levels = c("Class2", "Class1"))
  ci(roc_obj)
  }

outside_test <- lapply(outside_models, test_roc, data = imbal_test)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
outside_test <- lapply(outside_test, as.vector)
outside_test <- do.call("rbind", outside_test)
colnames(outside_test) <- c("lower", "ROC", "upper")
outside_test <- as.data.frame(outside_test)

outside_test
##              lower       ROC     upper
## original 0.9130010 0.9247957 0.9365905
## down     0.9286964 0.9368361 0.9449758
## up       0.9244128 0.9338499 0.9432869
## SMOTE    0.9429536 0.9490585 0.9551634
## ROSE     0.9383809 0.9459729 0.9535649

发现在测试集中结果完全不一样。。

在重抽样过程中使用subsampling

上面是先把数据划分为训练集和测试集,然后对训练集使用不同的配平方法,然后训练模型,最后在测试集中查看模型表现。

现在是在每一次重抽样中都使用一次配平方法。

library(doParallel)
cl <- makePSOCKcluster(16)
registerDoParallel(cl)


ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     ## new option here:
                     sampling = "down")

set.seed(5627)
down_inside <- train(Class ~ ., data = imbal_train,
                     method = "treebag",
                     nbagg = 50,
                     metric = "ROC",
                     trControl = ctrl)

## now just change that option
ctrl$sampling <- "up"

set.seed(5627)
up_inside <- train(Class ~ ., data = imbal_train,
                   method = "treebag",
                   nbagg = 50,
                   metric = "ROC",
                   trControl = ctrl)

ctrl$sampling <- "rose"

set.seed(5627)
rose_inside <- train(Class ~ ., data = imbal_train,
                     method = "treebag",
                     nbagg = 50,
                     metric = "ROC",
                     trControl = ctrl)

ctrl$sampling <- "smote"

set.seed(5627)
smote_inside <- train(Class ~ ., data = imbal_train,
                      method = "treebag",
                      nbagg = 50,
                      metric = "ROC",
                      trControl = ctrl)
## Loading required package: recipes
## Loading required package: dplyr


stopCluster(cl)

查看训练集中的结果:

inside_models <- list(original = orig_fit,
                      down = down_inside,
                      up = up_inside,
                      SMOTE = smote_inside,
                      ROSE = rose_inside)

inside_resampling <- resamples(inside_models)

summary(inside_resampling, metric = "ROC")
## 
## Call:
## summary.resamples(object = inside_resampling, metric = "ROC")
## 
## Models: original, down, up, SMOTE, ROSE 
## Number of resamples: 50 
## 
## ROC 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## original 0.9098237 0.9298348 0.9386021 0.9394130 0.9493394 0.9685873    0
## down     0.9140294 0.9381766 0.9453610 0.9438490 0.9492917 0.9684522    0
## up       0.8887678 0.9308075 0.9393226 0.9392084 0.9517913 0.9679569    0
## SMOTE    0.9262054 0.9426785 0.9519728 0.9507166 0.9591469 0.9735586    0
## ROSE     0.9305013 0.9442821 0.9489859 0.9511117 0.9572416 0.9756750    0

把训练好的模型作用于测试集,并查看结果:

inside_test <- lapply(inside_models, test_roc, data = imbal_test)
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
inside_test <- lapply(inside_test, as.vector)
inside_test <- do.call("rbind", inside_test)
colnames(inside_test) <- c("lower", "ROC", "upper")
inside_test <- as.data.frame(inside_test)
inside_test
##              lower       ROC     upper
## original 0.9130010 0.9247957 0.9365905
## down     0.9354534 0.9419704 0.9484875
## up       0.9353945 0.9431074 0.9508202
## SMOTE    0.9419165 0.9486050 0.9552936
## ROSE     0.9369170 0.9448367 0.9527563

今天虽然详细介绍了几种处理类不平衡数据的方法,但其实真正使用时也就1行代码的事,或者只是作为一个数据预处理步骤而已,大家理解下内涵即可。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值