caret教程03：数据划分

医学和生信笔记

于 2023-09-16 16:00:35 发布

阅读量204

点赞数

分类专栏： R语言机器学习文章标签： r语言机器学习 caret

本文链接：https://blog.csdn.net/Ayue0616/article/details/132919910

版权

R语言机器学习专栏收录该内容

26 篇文章 38 订阅

订阅专栏

caret提供了很多数据划分的函数，比如使用createDataPartition()实现简单的训练集测试集划分，还有时间序列数据集的划分等多种方法。

这部分内容比较简单，简单介绍下。

训练集/测试集划分

createDataPartition应该是用的最多的了，它可以实现简单的训练集/测试集划分。但其实你用sample函数也能实现一样的效果。

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice

# 设定种子数
set.seed(3456)

# 根据结果变量的类别多少划分
trainIndex <- createDataPartition(iris$Species, p = .8, # 划分比例
                                  list = FALSE)
head(trainIndex)
##      Resample1
## [1,]         1
## [2,]         2
## [3,]         3
## [4,]         5
## [5,]         6
## [6,]         7

irisTrain <- iris[ trainIndex,] # 训练集
irisTest  <- iris[-trainIndex,] # 测试集

交叉验证/bootstrap

除了这个训练集/测试集划分比较常用，使用频率较高的还有createResample()/createFolds()。

createFolds()可以实现交叉验证的样本抽样，createResample()可以实现bootstrap抽样，这几种抽样方法非常常用，在之前的推文中也有详细的介绍：预测建模中的重抽样方法

下面是一个简单的演示。

# 5折交叉验证
set.seed(111)

folds <- createFolds(iris$Species,
                     k = 5, # 5折
                     list = T
                     )
folds
## $Fold1
##  [1]   9  11  15  18  20  28  38  43  46  47  56  60  62  65  74  76  78  79  83
## [20]  94 105 107 116 117 118 122 123 127 142 149
## 
## $Fold2
##  [1]   4  14  17  34  37  39  42  45  49  50  54  58  63  67  70  75  77  84  89
## [20]  96 103 108 111 112 125 129 135 144 145 150
## 
## $Fold3
##  [1]   8  12  24  29  31  33  35  40  44  48  64  66  71  72  73  82  85  90  91
## [20]  93 101 102 106 110 115 120 132 133 137 146
## 
## $Fold4
##  [1]   1   3  16  21  22  25  26  27  30  41  51  52  55  59  68  81  86  87  95
## [20]  98 109 114 119 121 124 131 140 143 147 148
## 
## $Fold5
##  [1]   2   5   6   7  10  13  19  23  32  36  53  57  61  69  80  88  92  97  99
## [20] 100 104 113 126 128 130 134 136 138 139 141

# bootstrap
set.seed(111)

res <- createResample(iris$Species,
                      times = 2, # 抽取多少个自助集
                      list = T
                      )

res
## $Resample1
##   [1]   1   1   2   3   3   4   5   6   6   6   9  10  10  12  12  13  13  15
##  [19]  18  18  19  22  25  25  25  25  26  27  27  27  28  29  30  30  31  32
##  [37]  33  34  36  38  40  41  42  43  44  44  45  45  48  49  50  51  52  52
##  [55]  52  53  54  54  56  56  57  58  58  59  60  61  63  63  64  65  67  68
##  [73]  69  70  70  71  72  72  74  75  76  78  78  78  79  80  80  84  86  87
##  [91]  87  89  90  90  91  91  92  92  93  93  94  95  95  96  97 100 100 102
## [109] 103 104 106 106 106 111 112 113 113 114 115 116 117 117 118 120 120 120
## [127] 122 123 125 125 127 129 129 132 132 133 134 134 134 135 136 137 140 142
## [145] 143 144 146 147 149 149
## 
## $Resample2
##   [1]   2   3   3   5   7   8   8   8   8  10  10  10  11  11  14  16  16  16
##  [19]  17  17  18  18  21  21  21  23  23  25  25  25  26  28  29  29  30  30
##  [37]  30  30  33  34  36  38  38  39  41  41  41  42  44  44  47  47  48  48
##  [55]  48  49  49  50  50  50  51  52  53  54  54  55  55  55  55  55  56  57
##  [73]  58  59  59  60  63  65  66  66  67  69  70  70  71  72  75  76  76  78
##  [91]  78  79  81  81  83  85  85  86  87  90  91  91  92  92  93  95  97  97
## [109]  97 101 103 105 105 106 106 111 112 112 113 113 113 116 117 117 119 120
## [127] 120 125 125 126 127 128 130 130 130 132 132 133 134 135 137 140 141 142
## [145] 142 142 143 144 144 145

上面这两个函数在某些情况下还是很实用的，等遇到具体的例子再给大家演示。

下面的几种我用的不多，大家感兴趣的可以自行探索。

其他

maxDissim基于最大差异划分：

library(mlbench)
data(BostonHousing)

testing <- scale(BostonHousing[, c("age", "nox")])
set.seed(5)
## A random sample of 5 data points
startSet <- sample(1:dim(testing)[1], 5)
samplePool <- testing[-startSet,]
start <- testing[startSet,]
newSamp <- maxDissim(start, samplePool, n = 20) # 基于最大差异划分
head(newSamp)
## [1] 460 142 491 156 498  82

groupKFold根据组别进行划分：

set.seed(3527)
subjects <- sample(1:20, size = 80, replace = TRUE)
table(subjects)
## subjects
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  2  3  2  5  3  5  4  5  4  4  2  5  4  2  3  3  6  7  8  3
folds <- groupKFold(subjects, k = 15)