R语言抽样方法

Raque_S

已于 2022-01-31 00:03:49 修改

阅读量2.5k

点赞数 3

文章标签： r语言概率论

于 2022-01-30 22:58:33 首次发布

本文链接：https://blog.csdn.net/huierrr/article/details/122737026

版权

文章目录

sampling包
- strata:指定每个类别的抽取的样本数目
caret包

做点笔记备用
caret manual：

caret：
http://topepo.github.io/caret/index.html

sampling包

strata:指定每个类别的抽取的样本数目

导入数据Sonar：

library(mlbench)

data(Sonar) #mlbench包中自带的Sonar数据框为例
dim(Sonar)
View(Sonar)

使用strata函数进行抽样

#stratified random sample
subset <- strata(data = Sonar,stratanames='Class',size=rep(77,68),method="srswor")

#split dataset
#strata输出为包含"Class"、"ID_unit" 、"Prob" 、"Stratum"四个列的数据框，其中"ID_unit"为该行数据在原数据框中的索引
train <- Sonar[subset$ID_unit,] 
test <- Sonar[-subset$ID_unit,]

strata关键参数：

参数	含义
data	原数据，格式为数据框(data.frame)或矩阵(matrix)
stratanames	分类变量,上文中即为根据Class这一属性进行分类
size	对每个类别的抽样数目
mehod	抽样方法，无放回的简单随机抽样（srswor）、带放回的简单随机抽样（srswr）、泊松抽样（poisson）、系统抽样（systematic）；如果缺少“method”，则默认方法是“srswor”。

caret包

对R版本要求可能较高，R可能会提示需要加载recipes包。

createDataPartition:根据分类变量分层抽样

#install.packages(caret)
library(caret)
library(recipes)

train_index <- createDataPartition(Sonar$Class, #分类变量
               times = 1,p=0.7, #times：样本个数，此处为1 p:抽取的比例
               list = FALSE) #list = FALSE：不以列表格式返回
               
train <- Sonar[train_index,] 
test <- Sonar[-train_index,]

返回的结果为在原数据框中的索引。

head(train_index)
     Resample1
[1,]         1
[2,]         2
[3,]         4
[4,]         5
[5,]         6
[6,]        11

head(train)
  age           job marital education default balance housing loan  contact day month campaign pdays previous poutcome y
1  30    unemployed married   primary      no    1787      no   no cellular  19   oct        1    -1        0  unknown 0
2  33      services married secondary      no    4789     yes  yes cellular  11   may        1   339        4  failure 0
3  35    management  single  tertiary      no    1350     yes   no cellular  16   apr        1   330        1  failure 0
4  30    management married  tertiary      no    1476     yes  yes  unknown   3   jun        4    -1        0  unknown 0
7  36 self-employed married  tertiary      no     307     yes   no cellular  14   may        1   330        2    other 0
8  39    technician married secondary      no     147     yes   no cellular   6   may        2    -1        0  unknown 0

head(test)
   age           job marital education default balance housing loan  contact day month campaign pdays previous poutcome y
5   59   blue-collar married secondary      no       0     yes   no  unknown   5   may        1    -1        0  unknown 0
6   35    management  single  tertiary      no     747      no   no cellular  23   feb        2   176        3  failure 0
13  36    technician married  tertiary      no    1109      no   no cellular  13   aug        2    -1        0  unknown 0
18  37        admin.  single  tertiary      no    2317     yes   no cellular  20   apr        1   152        2  failure 0
20  31      services married secondary      no     132      no   no cellular   7   jul        1   152        1    other 0
29  56 self-employed married secondary      no     784      no  yes cellular  30   jul        2    -1        0  unknown 0

createResample:抽取一个或多个多个boostrap样本

subsets <- createResample(Sonar$Class,times = 5,list = FALSE)
head(subsets,10)

results：

head(subsets,10)
      Resample1 Resample2 Resample3 Resample4 Resample5
 [1,]         2         1         6         2         1
 [2,]         3         2         6         3         2
 [3,]         3         6         8         4         2
 [4,]         4         8         8         4         4
 [5,]         4         8         8         7         5
 [6,]         5         9         9         9         5
 [7,]         9        10        10        10         5
 [8,]         9        11        11        13         7
 [9,]         9        11        12        15         8
[10,]        10        11        13        16         9

createFolds：多个folds

folds <- createFolds(Sonar$Class, k = 10, list = TRUE, 
          returnTrain = FALSE) #returnTrain = FALSE：返回训练集，即数据为原数据的70%
length(folds)

result

length(folds)
[1] 10

可指定参数k，返回的列表长度即为k

createMultiFolds

可指定参数k，times，返回的列表长度即为k×times

mfolds <- createMultiFolds(Sonar$Class, k = 10, times = 5)
length(mfolds)