R 利用tidyr包处理数据实现交叉表

最新推荐文章于 2022-07-28 07:29:15 发布

梦想画家

最新推荐文章于 2022-07-28 07:29:15 发布

阅读量1.8k

点赞数 3

分类专栏： R语言文章标签： r语言开发语言

本文链接：https://blog.csdn.net/neweastsun/article/details/121287785

版权

R语言专栏收录该内容

69 篇文章 37 订阅

订阅专栏

本文使用 dplyr 和 tidyr包中函数实现较差表。

spread()函数

spread()函数来自tidyr包，可以把键值对变量展开为多列。语法如下：

spread(data, key value)

data：数据框名称
key：它的值将变成列变量
value: 它的值用于填充新产生变量

下面通过示例来的学习。

展开值为两列

#create data frame
df <- data.frame(player=rep(c('A', 'B'), each=4),
                 year=rep(c(1, 1, 2, 2), times=2),
                 stat=rep(c('points', 'assists'), times=4),
                 amount=c(14, 6, 18, 7, 22, 9, 38, 4))

#view data frame
df

#   player year    stat amount
# 1      A    1  points     14
# 2      A    1 assists      6
# 3      A    2  points     18
# 4      A    2 assists      7
# 5      B    1  points     22
# 6      B    1 assists      9
# 7      B    2  points     38
# 8      B    2 assists      4

library(tidyr)

#spread stat column across multiple columns
spread(df, key=stat, value=amount)

#   player year assists points
# 1      A    1       6     14
# 2      A    2       7     18
# 3      B    1       9     22
# 4      B    2       4     38

超过两列示例

#create data frame
df2 <- data.frame(player=rep(c('A'), times=8),
                 year=rep(c(1, 2), each=4),
                 stat=rep(c('points', 'assists', 'steals', 'blocks'), times=2),
                 amount=c(14, 6, 2, 1, 29, 9, 3, 4))

#view data frame
df2

#   player year    stat amount
# 1      A    1  points     14
# 2      A    1 assists      6
# 3      A    1  steals      2
# 4      A    1  blocks      1
# 5      A    2  points     29
# 6      A    2 assists      9
# 7      A    2  steals      3
# 8      A    2  blocks      4

library(tidyr)

#spread stat column across multiple columns
spread(df2, key=stat, value=amount)

#   player year assists blocks points steals
# 1      A    1       6      1     14      2
# 2      A    2       9      4     29      3

spreand() 分组场景

语法如下：


# df %>%
#   group_by(var1, var2) %>%
#   tally() %>%
#   spread(var1, n)

下面通过示例来学习。

创建基础交叉表

首先定义示例数据框：

#create data frame
df <- data.frame(team=c('T1', 'T1', 'T1', 'T1', 'T2', 'T2', 'T2', 'T2'),
                 position=c('G', 'G', 'F', 'C', 'G', 'F', 'F', 'C'),
                 points=c(7, 7, 8, 11, 13, 15, 19, 13))

#view data frame
df

#   team position points
# 1   T1        G      7
# 2   T1        G      7
# 3   T1        F      8
# 4   T1        C     11
# 5   T2        G     13
# 6   T2        F     15
# 7   T2        F     19
# 8   T2        C     13

下面代码使用team 和 position 变量创建交叉表：

library(dplyr)
library(tidyr)

#produce crosstab 
df %>%
  group_by(team, position) %>%
  tally() %>%
  spread(position, n)

# A tibble: 2 x 4
#   team      C     F     G
#   <chr> <int> <int> <int>
# 1 T1        1     1     2
# 2 T2        1     2     1

返回结果表示T1团队在C,F,G位置上的人数分别为1,1,2；T2团队在C,F,G位置上的人数分别为1,2,1。
如果我们展开团队变量呢，请看代码：


df %>%
  group_by(team, position) %>%
  tally() %>%
  spread(team, n)

# A tibble: 3 x 3
# # Groups:   position [3]
#   position    T1    T2
#   <chr>    <int> <int>
# 1 C            1     1
# 2 F            1     2
# 3 G            2     1

结果非常直接，关键是*spread(team, n)*实现方式非常简单且统一。

上面使用了tally函数。该函数是对summarise()函数的包装，同时依据第一次计算或重新计算而调用n() 、sum() 函数。相当于 count()函数，但之前要调用group_by() ，之后要调用ungroup().

tally()函数

我们看看tally()函数示例。

# tally() is short-hand for summarise()
mtcars %>% tally()
#>    n
#> 1 32
mtcars %>% group_by(cyl) %>% tally()
#> # A tibble: 3 x 2
#>     cyl     n
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14
# count() is a short-hand for group_by() + tally()
mtcars %>% count(cyl)
#> # A tibble: 3 x 2
#>     cyl     n
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14
# Note that if the data is already grouped, count() adds
# an additional group that is removed afterwards
mtcars %>% group_by(gear) %>% count(carb)
#> # A tibble: 11 x 3
#> # Groups:   gear [3]
#>     gear  carb     n
#>    <dbl> <dbl> <int>
#>  1     3     1     3
#>  2     3     2     4
#>  3     3     3     3
#>  4     3     4     5
#>  5     4     1     4
#>  6     4     2     4
#>  7     4     4     4
#>  8     5     2     2
#>  9     5     4     1
#> 10     5     6     1
#> 11     5     8     1

# add_tally() is short-hand for mutate()
mtcars %>% add_tally()
#> # A tibble: 32 x 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb     n
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4    32
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4    32
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1    32
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1    32
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2    32
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1    32
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4    32
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2    32
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2    32
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4    32
#> # … with 22 more rows
# add_count() is a short-hand for group_by() + add_tally()
mtcars %>% add_count(cyl)
#> # A tibble: 32 x 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb     n
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     7
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     7
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1    11
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     7
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2    14
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     7
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4    14
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2    11
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2    11
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     7
#> # … with 22 more rows

# count() and tally() are designed so that you can call
# them repeatedly, each time rolling up a level of detail

梦想画家

关注

3
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
R 利用tidyr包处理数据实现交叉表

本文使用 dplyr 和 tidyr包中函数实现较差表。spread()函数spread()函数来自tidyr包，可以把键值对变量展开为多列。语法如下：spread(data, key value)data：数据框名称key：它的值将变成列变量value: 它的值用于填充新产生变量下面通过示例来的学习。展开值为两列#create data framedf <- data.frame(player=rep(c('A', 'B'), each=4), .
复制链接

扫一扫

专栏目录