数据框--R语言

最新推荐文章于 2024-07-12 10:33:38 发布

牵牛花主人

最新推荐文章于 2024-07-12 10:33:38 发布

阅读量4.9k

点赞数 3

分类专栏： R

本文链接：https://blog.csdn.net/chongbaikaishi/article/details/115110009

版权

R 专栏收录该内容

24 篇文章

订阅专栏

1.数据框的创建

数据框是仅次于向量的最重要的数据类型对象，是R语言中最常处理的数据结构。如图所示的数据，由于数据有多种数据类型，无法将此数据集放入一个矩阵，在这种情况下，数据框是最佳选择。
在这里插入图片描述
1）数据框组织数据的结构与矩阵类似。
2）各列的数据类型可以不相同
3）数据框的每一列是一个变量，每行是一个观测样本
4）每列的长度必须相同

1.1 data.frame函数

函数功能:

The function data.frame() creates data frames, tightly coupled collections of variables 
which share many of the properties of matrices and of lists,
 used as the fundamental data structure by most of R's modeling software.

函数data.frame（）创建数据框，这些变量共享矩阵和列表的许多属性，被大多数R的建模软件用作基本数据结构。

函数语法：

data.frame(..., row.names = NULL, check.rows = FALSE,
           check.names = TRUE, fix.empty.names = TRUE,
           stringsAsFactors = default.stringsAsFactors())

函数参数：

...	
these arguments are of either the form value or tag = value. 
Component names are created based on the tag (if present) or the deparsed argument itself.

这些参数的形式为value或tag = value。组件名称是基于标签（如果存在）或已解析的参数本身创建的。

row.names	
NULL or a single integer or character string specifying a column to be used as row names,
 or a character or integer vector giving the row names for the data frame.

行名称：
取值为空或者单个数字或者字符串向量

check.rows	
if TRUE then the rows are checked for consistency of length and names.

行名称校验：
逻辑值，取值为TRUE，则校验行名称与行的长度是否一致

check.names	
logical. If TRUE then the names of the variables in the data frame are checked to ensure 
that they are syntactically valid variable names and are not duplicated. 
If necessary they are adjusted (by make.names) so that they are.

逻辑值，取值为TRUE，则检查数据框中的变量名称确保变量命名语法有效且变量名无重复。若有必要可以通过make.names参数进行调整

fix.empty.names	
logical indicating if arguments which are “unnamed” get an automatically constructed name or rather name "". 
Needs to be set to FALSE even when check.names is false if "" names should be kept.

逻辑值，表明未命名的参数是否自动命名还是命名为空。

stringsAsFactors	
logical: should character vectors be converted to factors? The ‘factory-fresh’ default is TRUE。

字符串是否转为因子
逻辑值：是否将字符串向量转化为因子，默认转化

> # 1. 创建数据框
> # 1.1 给出数据框中包含的向量
> patientID <- seq(1,4)
> age <- c(25,34,28,52)
> diabetes <- c('Type1','Type2','Type1','Type1')
> status <- c('Poor','Improved','Excellent','Poor')
> # 1.2 创建数据框
> patientdata <- data.frame(patientID,age,diabetes,status)
> patientdata
  patientID age diabetes    status
1         1  25    Type1      Poor
2         2  34    Type2  Improved
3         3  28    Type1 Excellent
4         4  52    Type1      Poor

增加行名称row.names

> patientdata <- data.frame(patientID,age,diabetes,status,
+                           row.names=c('病号1','病号2','病号3','病号4'))
> patientdata
      patientID age diabetes    status
病号1         1  25    Type1      Poor
病号2         2  34    Type2  Improved
病号3         3  28    Type1 Excellent
病号4         4  52    Type1      Poor

2. 数据框索引

数据框的索引和矩阵类似，主要有下标索引、行或列索引、元素索引。此外，对于数据框，还可以使用$符号按名称索引列数据.

2.1 行/列索引

> # 2. 数据框索引
> # 2.1  仅指定一个参数：返回指定列，返回值为数据框
> # 数据框[索引];数据框['列名']
> ww <- patientdata[3]
> rr <- patientdata['diabetes']
> ww
  diabetes
1    Type1
2    Type2
3    Type1
4    Type1
> rr
  diabetes
1    Type1
2    Type2
3    Type1
4    Type1
> class(ww)
[1] "data.frame"
> class(rr)
[1] "data.frame"
> #  2.2 指定两个参数，两个参数分别表示行、列索引
> 
> # 2.2.1 数据框[,列索引]:返回指定列，返回值为因子或者数值
> 
> qq <- patientdata[,1]   #索引第1列
> ss <- patientdata[,'status']  #按列名索引
> ff <- patientdata$diabetes   #按列名索引
> qq
[1] 1 2 3 4
> ss
[1] Poor      Improved  Excellent Poor     
Levels: Excellent Improved Poor
> ff
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2
> class(qq)
[1] "integer"
> class(ss)
[1] "factor"
> class(ff)
[1] "factor"
> 
> # 2.2.2 数据框[行索引,]:返回指定行，返回值为数据框
> 
> gg <- patientdata[1,]  #索引第1行
> gg
  patientID age diabetes status
1         1  25    Type1   Poor
> class(gg)
[1] "data.frame"
> 
> # 2.3 多行索引
> jj <- patientdata[c(2,4),]  #索引第2,4行
> jj
  patientID age diabetes   status
2         2  34    Type2 Improved
4         4  52    Type1     Poor
> class(jj)
[1] "data.frame"
> 
> # 2.4 多列索引
> ll <- patientdata[,c(2,4)]
> bb <- patientdata[,c('age','status')]
> ll
  age    status
1  25      Poor
2  34  Improved
3  28 Excellent
4  52      Poor
> bb
  age    status
1  25      Poor
2  34  Improved
3  28 Excellent
4  52      Poor
> class(ll)
[1] "data.frame"
> class(bb)
[1] "data.frame"
> tt <- patientdata[c(1,3)]
> tt
  patientID diabetes
1         1    Type1
2         2    Type2
3         3    Type1
4         4    Type1
> class(tt)
[1] "data.frame"

2.2 元素索引

> # 2.5 单个元素索引
> aa <- patientdata[2,4]  #索引第2行第4列的元素
> aa
[1] Improved
Levels: Excellent Improved Poor
> class(aa)
[1] "factor"
> 
> # 2.6 多个元素索引
> mm <- patientdata[c(2,4),c(2,4)]
> nn <- patientdata[c(2,4),c('age','status')]
> mm
  age   status
2  34 Improved
4  52     Poor
> nn
  age   status
2  34 Improved
4  52     Poor
> class(mm)
[1] "data.frame"
> class(nn)
[1] "data.frame"

2.3 函数索引

2.3.1 subset()函数

函数功能：

Return subsets of vectors, matrices or data frames 
which meet conditions.

返回满足条件的向量，矩阵或者数据框的子集

函数语法：

subset(x, subset, select, drop = FALSE, ...)

函数参数：

x	
object to be subsetted.

x：要选取子集的对象

默认选取全集

> subset(patientdata)
  patientID age diabetes    status
1         1  25    Type1      Poor
2         2  34    Type2  Improved
3         3  28    Type1 Excellent
4         4  52    Type1      Poor

subset	
logical expression indicating elements or rows to keep: 
missing values are taken as false.

逻辑型，要选取元素或者行的表达式。缺失值默认为FALSE

For data frames, the subset argument works on the rows.
Note that subset will be evaluated in the data frame, 
so columns can be referred to (by name) as variables 
in the expression

对于数据框，参数 $s u b s e t$ 在行上起作用，可以使用列名称作为表达式中的变量

> subset(patientdata,status=='Poor')
  patientID age diabetes status
1         1  25    Type1   Poor
4         4  52    Type1   Poor

select	
expression, indicating columns to select from a data frame.

表达式，表明要从数据框中选取的列

> subset(patientdata,status=='Poor',select=c('age','diabetes'))
  age diabetes
1  25    Type1
4  52    Type1

drop	
passed on to [ indexing operator.

> # 2.7 subset函数索引
> # 2.7.1  x: 数据框
> subset(patientdata)  #默认选取全集
  patientID age diabetes    status
1         1  25    Type1      Poor
2         2  34    Type2  Improved
3         3  28    Type1 Excellent
4         4  52    Type1      Poor
> # 2.7.2  subset：逻辑型，要选取元素或者行的表达式，缺失默认为FALSE
> subset(patientdata,status=='Poor')  
  patientID age diabetes status
1         1  25    Type1   Poor
4         4  52    Type1   Poor
> # 2.7.3 select :表达式，表明要从数据框中选取的列
> subset(patientdata,status=='Poor',select=c('age','diabetes')) 
  age diabetes
1  25    Type1
4  52    Type1

drop: 当前理解为是否删除原来的数据结构，默认不删除，保持原数据结构

> a <- subset(patientdata,status=='Improved') 
> a
  patientID age diabetes   status
2         2  34    Type2 Improved
> class(a)
[1] "data.frame"
> a <- subset(patientdata,status=='Improved',drop=T) 
> a
$patientID
[1] 2

$age
[1] 34

$diabetes
[1] Type2
Levels: Type1 Type2

$status
[1] Improved
Levels: Excellent Improved Poor

> class(a)
[1] "list"

3.数据框编辑

数据框可以通过edit函数和fix函数手动修改，也可以通过rbind函数和cbind函数分别增加新的样本数据和新属性变量。需要注意点额是，rbind函数的自变量的宽度（列数）应该与源数据框的宽度相等，而cbind函数的自变量的高度（行数）应该与原数据框的高度相等，否则程序会报错。此外，names函数可以读取数据框的列名以进行修改操作

3.1 edit()函数手动修改

函数功能：

Invoke a text editor on an R object.

在R对象上调用文本编辑器

函数语法：

edit(name = NULL, file = "", title = NULL,
     editor = getOption("editor"), ...)

函数参数：

name	
a named object that you want to edit. 
If name is missing then the file specified by file is opened for editing.

要编辑的R对象。

file	
a string naming the file to write the edited version to.

编辑后的R对象要写入的文件名称

title	
a display name for the object being edited.

标题：被剪辑对象的显示名称

在这里插入图片描述

3.2 增加行/列

3.2.1 增加行

> # 3. 数据框的编辑
> # 3.1 增加行
> name <- c('zhangsan','wangwu','liuer')
> age <- c(23,45,56)
> birth_place <- c('tianjin','wuhan','hefei')
> df <- data.frame(name,age,birth_place)
> df
      name age birth_place
1 zhangsan  23     tianjin
2   wangwu  45       wuhan
3    liuer  56       hefei
> df <- rbind(df,c('zhaosi',64,'liaoning'))
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "zhaosi") : 因子层次有错，产生了NA
2: In `[<-.factor`(`*tmp*`, ri, value = "liaoning") :
  因子层次有错，产生了NA
> df
      name age birth_place
1 zhangsan  23     tianjin
2   wangwu  45       wuhan
3    liuer  56       hefei
4     <NA>  64        <NA>
> df <- rbind(df,c('zhangsan',64,'wuhan'))
> df
      name age birth_place
1 zhangsan  23     tianjin
2   wangwu  45       wuhan
3    liuer  56       hefei
4     <NA>  64        <NA>
5 zhangsan  64       wuhan

报错是因为： 数据框创建的过程中会自动将字符型向量转化为因子，转化为因子之后，若想要增加行的话，默认职能增加已有的层次，层次就是指向量已有的不重复取值。如：出生地birth_place的取值为：tianjin,wuhan,hefei；增加的也需要是这三个取值中的一个。若想要增加新的到原来的数据框中，在新建数据框的时候，需要指定属性：stringsAsFactors=FALSE(不转化为因子，默认为TRUE)
在这里插入图片描述

> name <- c('zhangsan','wangwu','liuer')
> age <- c(23,45,56)
> birth_place <- c('tianjin','wuhan','hefei')
> df <- data.frame(name,age,birth_place,stringsAsFactors = F)
> df
      name age birth_place
1 zhangsan  23     tianjin
2   wangwu  45       wuhan
3    liuer  56       hefei
> df <- rbind(df,c('zhaosi',64,'liaoning'))
> df
      name age birth_place
1 zhangsan  23     tianjin
2   wangwu  45       wuhan
3    liuer  56       hefei
4   zhaosi  64    liaoning

3.2.2 增加列

> # 3.2 增加列
> # 3.2.1 cbind()函数
> data <- cbind(df,gender=c('male','female','female','male'))
> data
      name age birth_place gender gender
1 zhangsan  23     tianjin   male   male
2   wangwu  45       wuhan female female
3    liuer  56       hefei female female
4   zhaosi  64    liaoning   male   male
> # 3.2.2 data.frame
> data <- data.frame(df,Occupation=c('worker','cooker','artist','teacher'))
> data
      name age birth_place gender Occupation
1 zhangsan  23     tianjin   male     worker
2   wangwu  45       wuhan female     cooker
3    liuer  56       hefei female     artist
4   zhaosi  64    liaoning   male    teacher
> # 3.2.3 直接新增
> data['income'] <- c(12,16,50,10)
> data
      name age birth_place gender Occupation income
1 zhangsan  23     tianjin   male     worker     12
2   wangwu  45       wuhan female     cooker     16
3    liuer  56       hefei female     artist     50
4   zhaosi  64    liaoning   male    teacher     10

3.3 修改元素

3.3.1 修改某一行元素

新增的列默认依然会转化为因子，在进行修改操作时会报错。如下图所示，职业为后增加的属性，没有设置stringsAsFactors =F，当修改的变量职业（occupation）的取值在原水平中时，可以正常运行，但是当取值为男演员（actor）时会报错

在这里插入图片描述

需要在新增属性时设置stringsAsFactors =F，不转换成因子

> # 3.2 增加列
> # 3.2.1 cbind()函数
> data <- cbind(df,gender=c('male','female','female','male'),stringsAsFactors =F)
> data
      name age birth_place gender gender
1 zhangsan  23     tianjin   male   male
2   wangwu  45       wuhan female female
3    liuer  56       hefei female female
4   zhaosi  64    liaoning   male   male
> # 3.2.2 data.frame
> data <- data.frame(df,Occupation=c('worker','cooker','artist','teacher'),
+                    stringsAsFactors =F )
> data
      name age birth_place gender Occupation
1 zhangsan  23     tianjin   male     worker
2   wangwu  45       wuhan female     cooker
3    liuer  56       hefei female     artist
4   zhaosi  64    liaoning   male    teacher
> # 3.2.3 直接新增
> data['income'] <- c(12,16,50,10)
> data
      name age birth_place gender Occupation income
1 zhangsan  23     tianjin   male     worker     12
2   wangwu  45       wuhan female     cooker     16
3    liuer  56       hefei female     artist     50
4   zhaosi  64    liaoning   male    teacher     10

此时再进行修改某一行元素，则正常

> data[2,] <- c('wanglaoqi',58,'shenyang','female','actor',100)
> data
       name age birth_place gender Occupation income
1  zhangsan  23     tianjin   male     worker     12
2 wanglaoqi  58    shenyang female      actor    100
3     liuer  56       hefei female     artist     50
4    zhaosi  64    liaoning   male    teacher     10

3.3.2 修改某一列元素

> # 4.3 修改某一列元素
> data[,'income'] <- c(6,50,100,10)
> data
       name age birth_place gender Occupation income
1  zhangsan  23     tianjin   male     worker      6
2 wanglaoqi  58    shenyang female      actor     50
3     liuer  56       hefei female     artist    100
4    zhaosi  64    liaoning   male    teacher     10

3.3.3 修改某个元素

> data[2,3] <- 'jilin'
> data
       name age birth_place gender Occupation income
1  zhangsan  23     tianjin   male     worker      6
2 wanglaoqi  58       jilin female      actor     50
3     liuer  56       hefei female     artist    100
4    zhaosi  64    liaoning   male    teacher     10

3.4 删除元素

3.4.1 删除单行元素

> # 5.删除行/列
> # 5.1 删除单行
> data
      name age birth_place
1 zhangsan  23     tianjin
2   wangwu  45       wuhan
3    liuer  56       hefei
> data[-2,]  #删除第2行
      name age birth_place
1 zhangsan  23     tianjin
3    liuer  56       hefei

3.4.2 删除单列元素

> # 5.2 删除单列
> # 方式一： 
> data[-2]
      name birth_place
1 zhangsan     tianjin
2   wangwu       wuhan
3    liuer       hefei
> 
> # 方式二：
> data[,-2]
      name birth_place
1 zhangsan     tianjin
2   wangwu       wuhan
3    liuer       hefei

3.4.3 删除多行元素

> # 5.3 删除多行
> data
      name age birth_place
1 zhangsan  23     tianjin
2   wangwu  45       wuhan
3    liuer  56       hefei
> a <- data[-c(2,3),]  #删除第2,3行
> a
      name age birth_place
1 zhangsan  23     tianjin
> class(a)
[1] "data.frame"
> b <- data[c(-2,-3),]    #删除第2,3行
> b
      name age birth_place
1 zhangsan  23     tianjin
> class(b)
[1] "data.frame"

3.4.4 删除多列元素

> # 5.4 删除多列
> data
      name age birth_place
1 zhangsan  23     tianjin
2   wangwu  45       wuhan
3    liuer  56       hefei
> # 方式一
> c <- data[,-c(2,3)]  #删除第2,3列
> c
[1] "zhangsan" "wangwu"   "liuer"   
> class(c)
[1] "character"
> # 方式二：
> d <- data[,c(-2,-3)]  #删除第2,3列
> d
[1] "zhangsan" "wangwu"   "liuer"   
> class(d)
[1] "character"
> # 方式三：
> f <- data[c(-2,-3)]   #删除第2,3列
> f
      name
1 zhangsan
2   wangwu
3    liuer
> class(f)
[1] "data.frame"

在删除完元素只剩下一列时，使用前两种方式的数据结构最终会发生改变，此时可以增加参数drop=FALSE，即：不丢失原有数据结构

> # 方式一
> c <- data[,-c(2,3),drop=F]  #删除第2,3列
> c
      name
1 zhangsan
2   wangwu
3    liuer
> class(c)
[1] "data.frame"
> # 方式二：
> d <- data[,c(-2,-3),drop=F]  #删除第2,3列
> d
      name
1 zhangsan
2   wangwu
3    liuer
> class(d)
[1] "data.frame"