长表和宽表的相互转化

使用 函数melt()和decast(),是reshape2包中函数的增强

先考虑传统长宽表转化函数,

 

1.默认函数

a.融合data.table(宽表到长表)

考虑如下data.table:

> DT = fread("melt_default.csv")
> DT

   family_id age_mother dob_child1 dob_child2 dob_child3
1:         1         30 1998/11/26  2000/1/29         NA
2:         2         27  1996/6/22         NA         NA
3:         3         26  2002/7/11   2004/4/5   2007/9/2
4:         4         32 2004/10/10  2009/8/27  2012/7/21
5:         5         29  2000/12/5  2005/2/28         NA
> str(DT)
Classes ‘data.table’ and 'data.frame':        5 obs. of  5 variables:
 $ family_id : int  1 2 3 4 5
 $ age_mother: int  30 27 26 32 29
 $ dob_child1: chr  "1998/11/26" "1996/6/22" "2002/7/11" "2004/10/10" ...
 $ dob_child2: chr  "2000/1/29" NA "2004/4/5" "2009/8/27" ...
 $ dob_child3: chr  NA NA "2007/9/2" "2012/7/21" ...
 - attr(*, ".internal.selfref")=<externalptr>

 

把数据宽表转化长表,dob列合并成一列。

> DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"),
+              measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
> DT.m1

    family_id age_mother   variable      value
 1:         1         30 dob_child1 1998/11/26
 2:         2         27 dob_child1  1996/6/22
 3:         3         26 dob_child1  2002/7/11
 4:         4         32 dob_child1 2004/10/10
 5:         5         29 dob_child1  2000/12/5
 6:         1         30 dob_child2  2000/1/29
 7:         2         27 dob_child2         NA
 8:         3         26 dob_child2   2004/4/5
 9:         4         32 dob_child2  2009/8/27
10:         5         29 dob_child2  2005/2/28
11:         1         30 dob_child3         NA
12:         2         27 dob_child3         NA
13:         3         26 dob_child3   2007/9/2
14:         4         32 dob_child3  2012/7/21
15:         5         29 dob_child3         NA

> str(DT.m1)
Classes ‘data.table’ and 'data.frame':        15 obs. of  4 variables:
 $ family_id : int  1 2 3 4 5 1 2 3 4 5 ...
 $ age_mother: int  30 27 26 32 29 30 27 26 32 29 ...
 $ variable  : Factor w/ 3 levels "dob_child1","dob_child2",..: 1 1 1 1 1 2 2 2 2 2 ...
 $ value     : chr  "1998/11/26" "1996/6/22" "2002/7/11" "2004/10/10" ...
 - attr(*, ".internal.selfref")=<externalptr>

 

生成两个新变量,默认列名是“variable”和“value”

生成列“variable”默认类型是“factor”,若是想生成“charactor”,可用variable.factor=FALSE来设置,这个只能在data.table的melt中设置,不能用在reshape2中设置。

 

为变量“variable”和“value”进行命名

> DT.m1 = melt(DT, measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
+              variable.name = "child", value.name = "dob")
> DT.m1

    family_id age_mother      child        dob
 1:         1         30 dob_child1 1998/11/26
 2:         2         27 dob_child1  1996/6/22
 3:         3         26 dob_child1  2002/7/11
 4:         4         32 dob_child1 2004/10/10
 5:         5         29 dob_child1  2000/12/5
 6:         1         30 dob_child2  2000/1/29
 7:         2         27 dob_child2         NA
 8:         3         26 dob_child2   2004/4/5
 9:         4         32 dob_child2  2009/8/27
10:         5         29 dob_child2  2005/2/28
11:         1         30 dob_child3         NA
12:         2         27 dob_child3         NA
13:         3         26 dob_child3   2007/9/2
14:         4         32 dob_child3  2012/7/21
15:         5         29 dob_child3         NA

 

b)长表转成宽表

> dcast(DT.m1, family_id + age_mother ~ child, value.var = "dob")
   family_id age_mother dob_child1 dob_child2 dob_child3
1:         1         30 1998/11/26  2000/1/29         NA
2:         2         27  1996/6/22         NA         NA
3:         3         26  2002/7/11   2004/4/5   2007/9/2
4:         4         32 2004/10/10  2009/8/27  2012/7/21
5:         5         29  2000/12/5  2005/2/28         NA

 

也可以通过使用dcast()中的函数fun.aggregate进行分组计算,计算表DT.m1中,family_id下dob不等于NA的个数:

> dcast(DT.m1,family_id ~.,fun.agg=function(x) sum(!is.na(x)),value.var = "dob")
   family_id .
1:         1 2
2:         2 1
3:         3 3
4:         4 3
5:         5 2

 

 

2.当前melt/decast的使用局限性

> DT = fread("melt_enhanced.csv")
> DT

   family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1:         1         30 1998/11/26  2000/1/29         NA             1             2            NA
2:         2         27  1996/6/22         NA         NA             2            NA            NA
3:         3         26  2002/7/11   2004/4/5   2007/9/2             2             2             1
4:         4         32 2004/10/10  2009/8/27  2012/7/21             1             1             1
5:         5         29  2000/12/5  2005/2/28         NA             2             1            NA

 

若想合并所有的dob列和gender列,按照目前的做法,

> DT.m1 = melt(DT, id = c("family_id", "age_mother"))
Warning message:
In melt.data.table(DT, id = c("family_id", "age_mother")) :
  'measure.vars' [dob_child1, dob_child2, dob_child3, gender_child1, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'character'. All measure variables not of type 'character' will be coerced to. Check DETAILS in ?melt.data.table for more on coercion.

> DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)]
> DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
> DT.c1

    family_id age_mother  child        dob gender
 1:         1         30 child1 1998/11/26      1
 2:         1         30 child2  2000/1/29      2
 3:         1         30 child3         NA     NA
 4:         2         27 child1  1996/6/22      2
 5:         2         27 child2         NA     NA
 6:         2         27 child3         NA     NA
 7:         3         26 child1  2002/7/11      2
 8:         3         26 child2   2004/4/5      2
 9:         3         26 child3   2007/9/2      1
10:         4         32 child1 2004/10/10      1
11:         4         32 child2  2009/8/27      1
12:         4         32 child3  2012/7/21      1
13:         5         29 child1  2000/12/5      2
14:         5         29 child2  2005/2/28      1
15:         5         29 child3         NA     NA

 

> str(DT.c1)
Classes ‘data.table’ and 'data.frame':        15 obs. of  5 variables:
 $ family_id : int  1 1 1 2 2 2 3 3 3 4 ...
 $ age_mother: int  30 30 30 27 27 27 26 26 26 32 ...
 $ child     : chr  "child1" "child2" "child3" "child1" ...
 $ dob       : chr  "1998/11/26" "2000/1/29" NA "1996/6/22" ...
 $ gender    : chr  "1" "2" NA "2" ...
 - attr(*, ".internal.selfref")=<externalptr>
 - attr(*, "sorted")= chr  "family_id" "age_mother" "child"

 

 

3.增强新功能

a.增强melt

> colA = paste("dob_child", 1:3, sep = "")
> colB = paste("gender_child", 1:3, sep = "")
> DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender"))
> DT.m2

    family_id age_mother variable        dob gender
 1:         1         30        1 1998/11/26      1
 2:         2         27        1  1996/6/22      2
 3:         3         26        1  2002/7/11      2
 4:         4         32        1 2004/10/10      1
 5:         5         29        1  2000/12/5      2
 6:         1         30        2  2000/1/29      2
 7:         2         27        2         NA     NA
 8:         3         26        2   2004/4/5      2
 9:         4         32        2  2009/8/27      1
10:         5         29        2  2005/2/28      1
11:         1         30        3         NA     NA
12:         2         27        3         NA     NA
13:         3         26        3   2007/9/2      1
14:         4         32        3  2012/7/21      1
15:         5         29        3         NA     NA

 

> str(DT.m2)
Classes ‘data.table’ and 'data.frame':        15 obs. of  5 variables:
 $ family_id : int  1 2 3 4 5 1 2 3 4 5 ...
 $ age_mother: int  30 27 26 32 29 30 27 26 32 29 ...
 $ variable  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
 $ dob       : chr  "1998/11/26" "1996/6/22" "2002/7/11" "2004/10/10" ...
 $ gender    : int  1 2 2 1 2 2 NA 2 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

 

使用pattens

> DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
> DT.m2

    family_id age_mother variable        dob gender
 1:         1         30        1 1998/11/26      1
 2:         2         27        1  1996/6/22      2
 3:         3         26        1  2002/7/11      2
 4:         4         32        1 2004/10/10      1
 5:         5         29        1  2000/12/5      2
 6:         1         30        2  2000/1/29      2
 7:         2         27        2         NA     NA
 8:         3         26        2   2004/4/5      2
 9:         4         32        2  2009/8/27      1
10:         5         29        2  2005/2/28      1
11:         1         30        3         NA     NA
12:         2         27        3         NA     NA
13:         3         26        3   2007/9/2      1
14:         4         32        3  2012/7/21      1
15:         5         29        3         NA     NA

 

b.增强dcast

> DT.c2 = dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender"))
> DT.c2
   family_
id age_mother      dob_1     dob_2     dob_3 gender_1 gender_2 gender_3
1:         1         30 1998/11/26 2000/1/29        NA        1        2       NA
2:         2         27  1996/6/22        NA        NA        2       NA       NA
3:         3         26  2002/7/11  2004/4/5  2007/9/2        2        2        1
4:         4         32 2004/10/10 2009/8/27 2012/7/21        1        1        1
5:         5         29  2000/12/5 2005/2/28        NA        2        1       NA

 

多看书,多写文章,多读书(发出声音来的读),每天坚持1小时。可以学到很多东西。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值