1.数据框的创建
数据框是仅次于向量的最重要的数据类型对象,是R语言中最常处理的数据结构。如图所示的数据,由于数据有多种数据类型,无法将此数据集放入一个矩阵,在这种情况下,数据框是最佳选择。
1)数据框组织数据的结构与矩阵类似。
2)各列的数据类型可以不相同
3)数据框的每一列是一个变量,每行是一个观测样本
4)每列的长度必须相同
1.1 data.frame函数
函数功能:
The function data.frame() creates data frames, tightly coupled collections of variables
which share many of the properties of matrices and of lists,
used as the fundamental data structure by most of R's modeling software.
函数data.frame()创建数据框,这些变量共享矩阵和列表的许多属性,被大多数R的建模软件用作基本数据结构。
函数语法:
data.frame(..., row.names = NULL, check.rows = FALSE,
check.names = TRUE, fix.empty.names = TRUE,
stringsAsFactors = default.stringsAsFactors())
函数参数:
...
these arguments are of either the form value or tag = value.
Component names are created based on the tag (if present) or the deparsed argument itself.
这些参数的形式为value或tag = value。 组件名称是基于标签(如果存在)或已解析的参数本身创建的。
row.names
NULL or a single integer or character string specifying a column to be used as row names,
or a character or integer vector giving the row names for the data frame.
行名称:
取值为空或者单个数字或者字符串向量
check.rows
if TRUE then the rows are checked for consistency of length and names.
行名称校验:
逻辑值,取值为TRUE,则校验行名称与行的长度是否一致
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure
that they are syntactically valid variable names and are not duplicated.
If necessary they are adjusted (by make.names) so that they are.
逻辑值,取值为TRUE,则检查数据框中的变量名称确保变量命名语法有效且变量名无重复。若有必要可以通过make.names参数进行调整
fix.empty.names
logical indicating if arguments which are “unnamed” get an automatically constructed name or rather name "".
Needs to be set to FALSE even when check.names is false if "" names should be kept.
逻辑值,表明未命名的参数是否自动命名还是命名为空。
stringsAsFactors
logical: should character vectors be converted to factors? The ‘factory-fresh’ default is TRUE。
字符串是否转为因子
逻辑值:是否将字符串向量转化为因子,默认转化
> # 1. 创建数据框
> # 1.1 给出数据框中包含的向量
> patientID <- seq(1,4)
> age <- c(25,34,28,52)
> diabetes <- c('Type1','Type2','Type1','Type1')
> status <- c('Poor','Improved','Excellent','Poor')
> # 1.2 创建数据框
> patientdata <- data.frame(patientID,age,diabetes,status)
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
增加行名称row.names
> patientdata <- data.frame(patientID,age,diabetes,status,
+ row.names=c('病号1','病号2','病号3','病号4'))
> patientdata
patientID age diabetes status
病号1 1 25 Type1 Poor
病号2 2 34 Type2 Improved
病号3 3 28 Type1 Excellent
病号4 4 52 Type1 Poor
2. 数据框索引
数据框的索引和矩阵类似,主要有下标索引、行或列索引、元素索引。此外,对于数据框,还可以使用$符号按名称索引列数据.
2.1 行/列索引
> # 2. 数据框索引
> # 2.1 仅指定一个参数:返回指定列,返回值为数据框
> # 数据框[索引];数据框['列名']
> ww <- patientdata[3]
> rr <- patientdata['diabetes']
> ww
diabetes
1 Type1
2 Type2
3 Type1
4 Type1
> rr
diabetes
1 Type1
2 Type2
3 Type1
4 Type1
> class(ww)
[1] "data.frame"
> class(rr)
[1] "data.frame"
> # 2.2 指定两个参数,两个参数分别表示行、列索引
>
> # 2.2.1 数据框[,列索引]:返回指定列,返回值为因子或者数值
>
> qq <- patientdata[,1] #索引第1列
> ss <- patientdata[,'status'] #按列名索引
> ff <- patientdata$diabetes #按列名索引
> qq
[1] 1 2 3 4
> ss
[1] Poor Improved Excellent Poor
Levels: Excellent Improved Poor
> ff
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2
> class(qq)
[1] "integer"
> class(ss)
[1] "factor"
> class(ff)
[1] "factor"
>
> # 2.2.2 数据框[行索引,]:返回指定行,返回值为数据框
>
> gg <- patientdata[1,] #索引第1行
> gg
patientID age diabetes status
1 1 25 Type1 Poor
> class(gg)
[1] "data.frame"
>
> # 2.3 多行索引
> jj <- patientdata[c(2,4),] #索引第2,4行
> jj
patientID age diabetes status
2 2 34 Type2 Improved
4 4 52 Type1 Poor
> class(jj)
[1] "data.frame"
>
> # 2.4 多列索引
> ll <- patientdata[,c(2,4)]
> bb <- patientdata[,c('age','status')]
> ll
age status
1 25 Poor
2 34 Improved
3 28 Excellent
4 52 Poor
> bb
age status
1 25 Poor
2 34 Improved
3 28 Excellent
4 52 Poor
> class(ll)
[1] "data.frame"
> class(bb)
[1] "data.frame"
> tt <- patientdata[c(1,3)]
> tt
patientID diabetes
1 1 Type1
2 2 Type2
3 3 Type1
4 4 Type1
> class(tt)
[1] "data.frame"
2.2 元素索引
> # 2.5 单个元素索引
> aa <- patientdata[2,4] #索引第2行第4列的元素
> aa
[1] Improved
Levels: Excellent Improved Poor
> class(aa)
[1] "factor"
>
> # 2.6 多个元素索引
> mm <- patientdata[c(2,4),c(2,4)]
> nn <- patientdata[c(2,4),c('age','status')]
> mm
age status
2 34 Improved
4 52 Poor
> nn
age status
2 34 Improved
4 52 Poor
> class(mm)
[1] "data.frame"
> class(nn)
[1] "data.frame"
2.3 函数索引
2.3.1 subset()函数
函数功能:
Return subsets of vectors, matrices or data frames
which meet conditions.
返回满足条件的向量,矩阵或者数据框的子集
函数语法:
subset(x, subset, select, drop = FALSE, ...)
函数参数:
x
object to be subsetted.
x: 要选取子集的对象
默认选取全集
> subset(patientdata)
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
subset
logical expression indicating elements or rows to keep:
missing values are taken as false.
逻辑型,要选取元素或者行的表达式。缺失值默认为FALSE
For data frames, the subset argument works on the rows.
Note that subset will be evaluated in the data frame,
so columns can be referred to (by name) as variables
in the expression
对于数据框,参数 s u b s e t subset subset在行上起作用,可以使用列名称作为表达式中的变量
> subset(patientdata,status=='Poor')
patientID age diabetes status
1 1 25 Type1 Poor
4 4 52 Type1 Poor
select
expression, indicating columns to select from a data frame.
表达式,表明要从数据框中选取的列
> subset(patientdata,status=='Poor',select=c('age','diabetes'))
age diabetes
1 25 Type1
4 52 Type1
drop
passed on to [ indexing operator.
> # 2.7 subset函数索引
> # 2.7.1 x: 数据框
> subset(patientdata) #默认选取全集
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
> # 2.7.2 subset:逻辑型,要选取元素或者行的表达式,缺失默认为FALSE
> subset(patientdata,status=='Poor')
patientID age diabetes status
1 1 25 Type1 Poor
4 4 52 Type1 Poor
> # 2.7.3 select :表达式,表明要从数据框中选取的列
> subset(patientdata,status=='Poor',select=c('age','diabetes'))
age diabetes
1 25 Type1
4 52 Type1
drop: 当前理解为是否删除原来的数据结构,默认不删除,保持原数据结构
> a <- subset(patientdata,status=='Improved')
> a
patientID age diabetes status
2 2 34 Type2 Improved
> class(a)
[1] "data.frame"
> a <- subset(patientdata,status=='Improved',drop=T)
> a
$patientID
[1] 2
$age
[1] 34
$diabetes
[1] Type2
Levels: Type1 Type2
$status
[1] Improved
Levels: Excellent Improved Poor
> class(a)
[1] "list"
3.数据框编辑
数据框可以通过edit函数和fix函数手动修改,也可以通过rbind函数和cbind函数分别增加新的样本数据和新属性变量。需要注意点额是,rbind函数的自变量的宽度(列数)应该与源数据框的宽度相等,而cbind函数的自变量的高度(行数)应该与原数据框的高度相等,否则程序会报错。此外,names函数可以读取数据框的列名以进行修改操作
3.1 edit()函数手动修改
函数功能:
Invoke a text editor on an R object.
在R对象上调用文本编辑器
函数语法:
edit(name = NULL, file = "", title = NULL,
editor = getOption("editor"), ...)
函数参数:
name
a named object that you want to edit.
If name is missing then the file specified by file is opened for editing.
要编辑的R对象。
file
a string naming the file to write the edited version to.
编辑后的R对象要写入的文件名称
title
a display name for the object being edited.
标题: 被剪辑对象的显示名称
3.2 增加行/列
3.2.1 增加行
> # 3. 数据框的编辑
> # 3.1 增加行
> name <- c('zhangsan','wangwu','liuer')
> age <- c(23,45,56)
> birth_place <- c('tianjin','wuhan','hefei')
> df <- data.frame(name,age,birth_place)
> df
name age birth_place
1 zhangsan 23 tianjin
2 wangwu 45 wuhan
3 liuer 56 hefei
> df <- rbind(df,c('zhaosi',64,'liaoning'))
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "zhaosi") : 因子层次有错,产生了NA
2: In `[<-.factor`(`*tmp*`, ri, value = "liaoning") :
因子层次有错,产生了NA
> df
name age birth_place
1 zhangsan 23 tianjin
2 wangwu 45 wuhan
3 liuer 56 hefei
4 <NA> 64 <NA>
> df <- rbind(df,c('zhangsan',64,'wuhan'))
> df
name age birth_place
1 zhangsan 23 tianjin
2 wangwu 45 wuhan
3 liuer 56 hefei
4 <NA> 64 <NA>
5 zhangsan 64 wuhan
报错是因为: 数据框创建的过程中会自动将字符型向量转化为因子,转化为因子之后,若想要增加行的话,默认职能增加已有的层次,层次就是指向量已有的不重复取值。如:出生地birth_place的取值为:tianjin,wuhan,hefei;增加的也需要是这三个取值中的一个。若想要增加新的到原来的数据框中,在新建数据框的时候,需要指定属性:stringsAsFactors=FALSE(不转化为因子,默认为TRUE)
> name <- c('zhangsan','wangwu','liuer')
> age <- c(23,45,56)
> birth_place <- c('tianjin','wuhan','hefei')
> df <- data.frame(name,age,birth_place,stringsAsFactors = F)
> df
name age birth_place
1 zhangsan 23 tianjin
2 wangwu 45 wuhan
3 liuer 56 hefei
> df <- rbind(df,c('zhaosi',64,'liaoning'))
> df
name age birth_place
1 zhangsan 23 tianjin
2 wangwu 45 wuhan
3 liuer 56 hefei
4 zhaosi 64 liaoning
3.2.2 增加列
> # 3.2 增加列
> # 3.2.1 cbind()函数
> data <- cbind(df,gender=c('male','female','female','male'))
> data
name age birth_place gender gender
1 zhangsan 23 tianjin male male
2 wangwu 45 wuhan female female
3 liuer 56 hefei female female
4 zhaosi 64 liaoning male male
> # 3.2.2 data.frame
> data <- data.frame(df,Occupation=c('worker','cooker','artist','teacher'))
> data
name age birth_place gender Occupation
1 zhangsan 23 tianjin male worker
2 wangwu 45 wuhan female cooker
3 liuer 56 hefei female artist
4 zhaosi 64 liaoning male teacher
> # 3.2.3 直接新增
> data['income'] <- c(12,16,50,10)
> data
name age birth_place gender Occupation income
1 zhangsan 23 tianjin male worker 12
2 wangwu 45 wuhan female cooker 16
3 liuer 56 hefei female artist 50
4 zhaosi 64 liaoning male teacher 10
3.3 修改元素
3.3.1 修改某一行元素
新增的列默认依然会转化为因子,在进行修改操作时会报错。如下图所示,职业为后增加的属性,没有设置stringsAsFactors =F,当修改的变量职业(occupation)的取值在原水平中时,可以正常运行,但是当取值为男演员(actor)时会报错
需要在新增属性时设置stringsAsFactors =F,不转换成因子
> # 3.2 增加列
> # 3.2.1 cbind()函数
> data <- cbind(df,gender=c('male','female','female','male'),stringsAsFactors =F)
> data
name age birth_place gender gender
1 zhangsan 23 tianjin male male
2 wangwu 45 wuhan female female
3 liuer 56 hefei female female
4 zhaosi 64 liaoning male male
> # 3.2.2 data.frame
> data <- data.frame(df,Occupation=c('worker','cooker','artist','teacher'),
+ stringsAsFactors =F )
> data
name age birth_place gender Occupation
1 zhangsan 23 tianjin male worker
2 wangwu 45 wuhan female cooker
3 liuer 56 hefei female artist
4 zhaosi 64 liaoning male teacher
> # 3.2.3 直接新增
> data['income'] <- c(12,16,50,10)
> data
name age birth_place gender Occupation income
1 zhangsan 23 tianjin male worker 12
2 wangwu 45 wuhan female cooker 16
3 liuer 56 hefei female artist 50
4 zhaosi 64 liaoning male teacher 10
此时再进行修改某一行元素,则正常
> data[2,] <- c('wanglaoqi',58,'shenyang','female','actor',100)
> data
name age birth_place gender Occupation income
1 zhangsan 23 tianjin male worker 12
2 wanglaoqi 58 shenyang female actor 100
3 liuer 56 hefei female artist 50
4 zhaosi 64 liaoning male teacher 10
3.3.2 修改某一列元素
> # 4.3 修改某一列元素
> data[,'income'] <- c(6,50,100,10)
> data
name age birth_place gender Occupation income
1 zhangsan 23 tianjin male worker 6
2 wanglaoqi 58 shenyang female actor 50
3 liuer 56 hefei female artist 100
4 zhaosi 64 liaoning male teacher 10
3.3.3 修改某个元素
> data[2,3] <- 'jilin'
> data
name age birth_place gender Occupation income
1 zhangsan 23 tianjin male worker 6
2 wanglaoqi 58 jilin female actor 50
3 liuer 56 hefei female artist 100
4 zhaosi 64 liaoning male teacher 10
3.4 删除元素
3.4.1 删除单行元素
> # 5.删除行/列
> # 5.1 删除单行
> data
name age birth_place
1 zhangsan 23 tianjin
2 wangwu 45 wuhan
3 liuer 56 hefei
> data[-2,] #删除第2行
name age birth_place
1 zhangsan 23 tianjin
3 liuer 56 hefei
3.4.2 删除单列元素
> # 5.2 删除单列
> # 方式一:
> data[-2]
name birth_place
1 zhangsan tianjin
2 wangwu wuhan
3 liuer hefei
>
> # 方式二:
> data[,-2]
name birth_place
1 zhangsan tianjin
2 wangwu wuhan
3 liuer hefei
3.4.3 删除多行元素
> # 5.3 删除多行
> data
name age birth_place
1 zhangsan 23 tianjin
2 wangwu 45 wuhan
3 liuer 56 hefei
> a <- data[-c(2,3),] #删除第2,3行
> a
name age birth_place
1 zhangsan 23 tianjin
> class(a)
[1] "data.frame"
> b <- data[c(-2,-3),] #删除第2,3行
> b
name age birth_place
1 zhangsan 23 tianjin
> class(b)
[1] "data.frame"
3.4.4 删除多列元素
> # 5.4 删除多列
> data
name age birth_place
1 zhangsan 23 tianjin
2 wangwu 45 wuhan
3 liuer 56 hefei
> # 方式一
> c <- data[,-c(2,3)] #删除第2,3列
> c
[1] "zhangsan" "wangwu" "liuer"
> class(c)
[1] "character"
> # 方式二:
> d <- data[,c(-2,-3)] #删除第2,3列
> d
[1] "zhangsan" "wangwu" "liuer"
> class(d)
[1] "character"
> # 方式三:
> f <- data[c(-2,-3)] #删除第2,3列
> f
name
1 zhangsan
2 wangwu
3 liuer
> class(f)
[1] "data.frame"
在删除完元素只剩下一列时,使用前两种方式的数据结构最终会发生改变,此时可以增加参数drop=FALSE,即:不丢失原有数据结构
> # 方式一
> c <- data[,-c(2,3),drop=F] #删除第2,3列
> c
name
1 zhangsan
2 wangwu
3 liuer
> class(c)
[1] "data.frame"
> # 方式二:
> d <- data[,c(-2,-3),drop=F] #删除第2,3列
> d
name
1 zhangsan
2 wangwu
3 liuer
> class(d)
[1] "data.frame"