【无标题】

文章介绍了在R软件中对数据进行重编码的三种常见方法:1)使用逻辑判断式,如ifelse函数;2)利用cut函数进行分组编码;3)借助car包的recode函数处理复杂编码规则。这些方法适用于将连续变量转换为分类变量,或者对已有变量值进行定制化的转换。
摘要由CSDN通过智能技术生成
#变量的重编码
#在分析数据时我们经常会遇到将变量值转换成其他的值的情况(如:将连续变量转成分类变量)这时我们就需要我们对原有数据进行重新编码。本文将介绍R软件中常用的三种编码方法
> 1,使用逻辑判断式编码
> 2,使用cut函数编码
> 3,使用car程序包的recode函数
> 4,使用car程序包的recode函数
> 1 使用逻辑判断式编码



> 现假设我们需要将下面的连续型变量x按照10与20分成三个组,新的分组名称为1、2、3:
> x <- c(4,12,50,18,50,22,23,46,8,46,36,18,10,14,35,48,23,17,29,30)
> x2 <- 1*(x <=10) + 2*(x>10 &x<=20) + 3*(x>20)
> labels <- c("A","B","C")
> x3 <- labels[x2]



> 2 使用ifelse函数
> #基本语法:ifelse(逻辑判断式,TRUE - 表达式,FALSE-表达式)
> #编码分为两个组
> x <- c(4,12,50,18,50,22,23,46,8,46,36,18,10,14,35,48,23,17,29,30)
> x2=ifelse(x<=30,1,2)    #使用ifelse函数满足条件返回1,不满足条件返回2
> x2
 [1] 1 1 2 1 2 1 1 2 1 2 2 1 1 1 2 2 1 1 1 1
> x3=ifelse(x<=30,"A","B")

> y <- c("B","A","C","C","B","A","D","B","C","D")
> y2 <- ifelse(y %in% c("A","C"),"Group1","Group2") #搭配%int%运算符,将"A","C"重编码为"Group1","B","D"重编码为Group2
> y2
 [1] "Group2" "Group1" "Group1" "Group1" "Group2" "Group1" "Group2" "Group2" "Group1"
[10] "Group2"




> 3 使用 cut()函数编码
> #其中
> #x为数值向量
> #breaks为分割点信息。若breaks为向量,则根据向量中的数字进行分割,若breaks为大于1的正整数k,
> #则将x分为均等的k组
> #labels为分割后各组的名称,若为null,则输出为数字向量,否则输出factor变量。
> #include.lowest=FALSE表示分割时不含各区间端点的最小值
> #right=T表示各区间左端为open,右端为closed的区间
> ?cut
> x2 <- cut(x,breaks = C(0,10,20,max(x)))
Error in C(0, 10, 20, max(x)) : object not interpretable as a factor
> x2 <- cut(x,breaks = C(0,10,20,max(x)),labels = c(1,2,3))
Error in C(0, 10, 20, max(x)) : object not interpretable as a factor
> x2 <- cut(x,breaks = C(0,10,20,48),labels = c(1,2,3))
Error in C(0, 10, 20, 48) : object not interpretable as a factor
> x2 <- cut(x,breaks = c(0,10,20,max(x)),labels = c(1,2,3))
> x2
 [1] 1 2 3 2 3 3 3 3 1 3 3 2 1 2 3 3 3 2 3 3
Levels: 1 2 3
> x2 <- cut(x,breaks = c(0,10,20,max(x)))
> x2
 [1] (0,10]  (10,20] (20,50] (10,20] (20,50] (20,50] (20,50] (20,50] (0,10]  (20,50]
[11] (20,50] (10,20] (0,10]  (10,20] (20,50] (20,50] (20,50] (10,20] (20,50] (20,50]
Levels: (0,10] (10,20] (20,50]
> x2 <- cut(x,breaks = c(0,10,20,max(x)),labels = c(1,2,3),include.lowest = TRUE)
> x2
 [1] 1 2 3 2 3 3 3 3 1 3 3 2 1 2 3 3 3 2 3 3
Levels: 1 2 3
> x2 <- cut(x,breaks = c(0,10,20,max(x)),include.lowest = TRUE)
> x2
 [1] [0,10]  (10,20] (20,50] (10,20] (20,50] (20,50] (20,50] (20,50] [0,10]  (20,50]
[11] (20,50] (10,20] [0,10]  (10,20] (20,50] (20,50] (20,50] (10,20] (20,50] (20,50]
Levels: [0,10] (10,20] (20,50]
> x2 <- cut(x,breaks = c(0,10,20,max(x)),labels = c(1,2,3),include.lowest = TRUE)
> x2
 [1] 1 2 3 2 3 3 3 3 1 3 3 2 1 2 3 3 3 2 3 3
Levels: 1 2 3
> as.vector(x2)
 [1] "1" "2" "3" "2" "3" "3" "3" "3" "1" "3" "3" "2" "1" "2" "3" "3" "3" "2" "3" "3"
> score
Error: object 'score' not found
> #现在我们模拟产生10个N(60,10)的随机成绩,并使用cut函数的breaks选项将其分成5组
> #生成10个平均值为60,标准差为10的正太分布数值并取整
> score = round(rnorm(10,60,10))
> score
 [1] 41 58 67 75 73 64 81 75 66 61
> score <- cut(score,breaks = 5)
> score
 [1] (41,49] (57,65] (65,73] (73,81] (65,73] (57,65] (73,81] (73,81] (65,73] (57,65]
Levels: (41,49] (49,57] (57,65] (65,73] (73,81]
> #由以上结果可知,cut()函数默认输出一个factor变量,并且自动将五个分组命名为(39,46.2] ...(67.8,75]
> cut()函数返回的分组标签名称有三种方式,第一种通过参数labels主动设置标签名称,第二种使用cut()函数的默认值返回区间作为标签的名称如:(41,49] (57,65] (65,73] (73,81] (65,73] 这种类型的,第三种设置labels=FALSE则返回的标签名称是数值所在的第几区间如:5 3 4 3 5 3 5 1 3 4,其中1表示最大的区间,5表示最小的区间。
> #如何cut()的选项labels=FALSE,则输出的结果是数字编码(返回在第几个区间)
> score = round(rnorm(10,60,10))
> score.cut <- cut(score,breaks = 5,labels = FALSE)
> score.cut
 [1] 5 3 4 3 5 3 5 1 3 4
> score.cut = cut(score,breaks = 5 )
> score.cut
 [1] (60.2,64]   (52.6,56.4] (56.4,60.2] (52.6,56.4] (60.2,64]   (52.6,56.4] (60.2,64]  
 [8] (45,48.8]   (52.6,56.4] (56.4,60.2]
Levels: (45,48.8] (48.8,52.6] (52.6,56.4] (56.4,60.2] (60.2,64],



> 4 使用car程序包的recode函数
> #recodes参数的值是一个字符串,字符串里面是以分号分隔的编码规则:
> #recodes=“规则1","规则2",..."
> #每一个编码规则的格式为旧码列表=新码,“旧码列表”部分可用lo代表旧码的最小值,hi代表旧码的最大值,撰写规则如下
> #(1)旧码=新码 旧码只有一个单数值。例如:"0=NA"表示将0改为NA。
> #(2)旧码向量=新码 多个旧码改为一个新码。例如:"c(7,8,9)="hight",将7,8,9改为high
> #(3)start:end=新码 有序数字改码。例如"lo:19="c".
> #(4)else=新码 所有其他情况。例如:"else=NA".
> recodes与ifelse函数比较类似,但是在需要分成多组的情况下,recodes更好用(不用ifelse函数多层嵌套),相对于cut函数更加灵活,可以应对更加复杂的分组方式,如把等于A或者C的分到一个组这种情况。
> #例子
> library(carData)
> library(car)
> x2
 [1] 1 2 3 2 3 3 3 3 1 3 3 2 1 2 3 3 3 2 3 3
Levels: 1 2 3
> x <- c(1,2,3,1,2,3,1,2,3)
> recode(x,"c(1,2)='A';else='B'")
[1] "A" "A" "B" "A" "A" "B" "A" "A" "B"
> #将成绩0~40分之间的分数编码为1,41-60分之间为2,61-80分为3,81以上为4,其他情况为NA
> score
 [1] 62 55 58 56 64 56 61 45 55 57
> recode(score,"0:40=1;41:60=2;61:80=3;81:hi=4;else=NA")
 [1] 3 2 2 2 3 2 3 2 2 2
> #上例子改为‘A' 'B' 'C' 'D'
> recode(score,"0:40='A';41:60='B';61:80='C';81:hi='D';else=NA")
 [1] "C" "B" "B" "B" "C" "B" "C" "B" "B" "B"

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值