R语言——K折交叉验证之随机均分数据集

最新推荐文章于 2022-10-14 23:04:57 发布

Mr_Fengyy

最新推荐文章于 2022-10-14 23:04:57 发布

阅读量3.1k

点赞数 1

本文链接：https://blog.csdn.net/weixin_41030360/article/details/80891737

版权

本文介绍了如何使用R语言对一个数据集按因子进行随机均分，以完成K折交叉验证。作者在阅读《复杂数据统计方法》后，编写了一个自定义函数，简化了将数据集分成n份的过程。示例中，作者以iris数据集为案例，展示了如何对按Species因子分组的数据进行五折交叉验证。

摘要由CSDN通过智能技术生成

今天，在阅读吴喜之教授的《复杂数据统计方法》时，遇到了把一个数据集按照某个因子分成若干子集，再把若干子集随机平均分成n份的问题，吴教授的方法也比较好理解，但是我还是觉得有点繁琐，因此自己编写了一个函数，此后遇到这种问题只需要运行一下函数就可以了。

这里采用R中自带的iris数据集，

> str(iris)
'data.frame':	150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

iris数据集结构如上所示，其中Species是一个因子型数据，共有三个水平，根据Species将其可以分成三个子集，对每个子集进行五折交叉验证的话，需要把每个数据集均分成五份，R语言代码如下：

fiveDivide<-function(col,data,n=5)
{
  #col is a facotr type column,divide each group of the dataframe 
  #into n partitions,string type
  #data is a data.frame type in R
  #n represents the numbers which you want to divide into,default 5
  #the function return a list contain n data.frame
  #use sample(x) generate x numbers in unordered rank,then
  #divide the x numebr into n partitions
  group_num=length(levels(data[,col]))  #
  lst1=list() #按照因子分类把原数据分成group_num份
  lst2=list() #把每一个gruop分成等分的数据框
  lst3=list() #
  for(i in 1:group_num)
  {
    lst1[[i]]=data[data[col]==levels(data[,col])[i],]  #这里先把原数据集按照因子水平分成n个子集
  }
  for(k in 1:group_num)  #这个循环的目的就是把么个子集平均分成n份，并且是随机分的，需要用到sample函数
  {
    od=sample(nrow(lst1[[k]]))
    newdata=lst1[[k]][od,]
    len=length(od)
    cutpoint=floor(len/n)
    for(j in 1:n)
    {
      if(len>=cutpoint*(1+j))
      {
        lst2[[j]]=newdata[(cutpoint*(j-1)+1):(cutpoint*j),]
      }
      else
      {
        lst2[[j]]=newdata[(cutpoint*(j-1)+1):len,]
      }
    }
    lst3[[k]]=lst2
  }
  return(lst3)
  #lst2=list()
}

　　对iris进行处理：

> rep=fiveDivide("Species",iris,5)
> str(rep)
List of 3
 $ :List of 5
  ..$ :'data.frame':	10 obs. of  5 variables:
  .. ..$ Sepal.Length: num [1:10] 4.8 5.2 4.8 4.7 5.5 5.1 4.8 4.4 4.8 4.9
  .. ..$ Sepal.Width : num [1:10] 3 3.5 3.4 3.2 3.5 3.7 3.1 3 3.4 3
  .. ..$ Petal.Length: num [1:10] 1.4 1.5 1.6 1.6 1.3 1.5 1.6 1.3 1.9 1.4
  .. ..$ Petal.Width : num [1:10] 0.3 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2
  .. ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1
  ..$ :'data.frame':	10 obs. of  5 variables:
  .. ..$ Sepal.Length: num [1:10] 5 4.7 4.8 5.2 5.1 5.1 4.9 5.4 5 5.5
  .. ..$ Sepal.Width : num [1:10] 3.5 3.2 3 3.4 3.5 3.8 3.1 3.4 3.5 4.2
  .. ..$ Petal.Length: num [1:10] 1.3 1.3 1.4 1.4 1.4 1.5 1.5 1.7 1.6 1.4
  .. ..$ Petal.Width : num [1:10] 0.3 0.2 0.1 0.2 0.2 0.3 0.1 0.2 0.6 0.2
  .. ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1
  ..$ :'data.frame':	10 obs. of  5 variables:
  .. ..$ Sepal.Length: num [1:10] 5.4 4.3 4.9 5.4 4.4 4.6 5.1 5 5.1 5.1
  .. ..$ Sepal.Width : num [1:10] 3.9 3 3.6 3.9 3.2 3.6 3.4 3.4 3.8 3.8
  .. ..$ Petal.Length: num [1:10] 1.3 1.1 1.4 1.7 1.3 1 1.5 1.6 1.9 1.6
  .. ..$ Petal.Width : num [1:10] 0.4 0.1 0.1 0.4 0.2 0.2 0.2 0.4 0.4 0.2
  .. ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1
  ..$ :'data.frame':	10 obs. of  5 variables:
  .. ..$ Sepal.Length: num [1:10] 4.4 4.5 5.3 5 5 5.1 5.4 5.2 5.1 5.4
  .. ..$ Sepal.Width : num [1:10] 2.9 2.3 3.7 3.3 3.4 3.3 3.7 4.1 3.5 3.4
  .. ..$ P