文章目录
做点笔记备用
caret manual:
caret:
http://topepo.github.io/caret/index.html
sampling包
strata:指定每个类别的抽取的样本数目
导入数据Sonar:
library(mlbench)
data(Sonar) #mlbench包中自带的Sonar数据框为例
dim(Sonar)
View(Sonar)
使用strata函数进行抽样
#stratified random sample
subset <- strata(data = Sonar,stratanames='Class',size=rep(77,68),method="srswor")
#split dataset
#strata输出为包含"Class"、"ID_unit" 、"Prob" 、"Stratum"四个列的数据框,其中"ID_unit"为该行数据在原数据框中的索引
train <- Sonar[subset$ID_unit,]
test <- Sonar[-subset$ID_unit,]
strata关键参数:
参数 | 含义 |
---|---|
data | 原数据,格式为数据框(data.frame)或矩阵(matrix) |
stratanames | 分类变量,上文中即为根据Class这一属性进行分类 |
size | 对每个类别的抽样数目 |
mehod | 抽样方法,无放回的简单随机抽样(srswor)、带放回的简单随机抽样(srswr)、泊松抽样(poisson)、系统抽样(systematic); 如果缺少“method”,则默认方法是“srswor”。 |
caret包
对R版本要求可能较高,R可能会提示需要加载recipes包。
createDataPartition:根据分类变量分层抽样
#install.packages(caret)
library(caret)
library(recipes)
train_index <- createDataPartition(Sonar$Class, #分类变量
times = 1,p=0.7, #times:样本个数,此处为1 p:抽取的比例
list = FALSE) #list = FALSE:不以列表格式返回
train <- Sonar[train_index,]
test <- Sonar[-train_index,]
返回的结果为在原数据框中的索引。
head(train_index)
Resample1
[1,] 1
[2,] 2
[3,] 4
[4,] 5
[5,] 6
[6,] 11
head(train)
age job marital education default balance housing loan contact day month campaign pdays previous poutcome y
1 30 unemployed married primary no 1787 no no cellular 19 oct 1 -1 0 unknown 0
2 33 services married secondary no 4789 yes yes cellular 11 may 1 339 4 failure 0
3 35 management single tertiary no 1350 yes no cellular 16 apr 1 330 1 failure 0
4 30 management married tertiary no 1476 yes yes unknown 3 jun 4 -1 0 unknown 0
7 36 self-employed married tertiary no 307 yes no cellular 14 may 1 330 2 other 0
8 39 technician married secondary no 147 yes no cellular 6 may 2 -1 0 unknown 0
head(test)
age job marital education default balance housing loan contact day month campaign pdays previous poutcome y
5 59 blue-collar married secondary no 0 yes no unknown 5 may 1 -1 0 unknown 0
6 35 management single tertiary no 747 no no cellular 23 feb 2 176 3 failure 0
13 36 technician married tertiary no 1109 no no cellular 13 aug 2 -1 0 unknown 0
18 37 admin. single tertiary no 2317 yes no cellular 20 apr 1 152 2 failure 0
20 31 services married secondary no 132 no no cellular 7 jul 1 152 1 other 0
29 56 self-employed married secondary no 784 no yes cellular 30 jul 2 -1 0 unknown 0
createResample:抽取一个或多个多个boostrap样本
subsets <- createResample(Sonar$Class,times = 5,list = FALSE)
head(subsets,10)
results:
head(subsets,10)
Resample1 Resample2 Resample3 Resample4 Resample5
[1,] 2 1 6 2 1
[2,] 3 2 6 3 2
[3,] 3 6 8 4 2
[4,] 4 8 8 4 4
[5,] 4 8 8 7 5
[6,] 5 9 9 9 5
[7,] 9 10 10 10 5
[8,] 9 11 11 13 7
[9,] 9 11 12 15 8
[10,] 10 11 13 16 9
createFolds:多个folds
folds <- createFolds(Sonar$Class, k = 10, list = TRUE,
returnTrain = FALSE) #returnTrain = FALSE:返回训练集,即数据为原数据的70%
length(folds)
result
length(folds)
[1] 10
可指定参数k,返回的列表长度即为k
createMultiFolds
可指定参数k,times,返回的列表长度即为k×times
mfolds <- createMultiFolds(Sonar$Class, k = 10, times = 5)
length(mfolds)
result
> length(mfolds)
[1] 50