R语言dplyr入门到进阶

医学和生信笔记

已于 2022-11-12 10:25:33 修改

阅读量1.3k

点赞数 5

分类专栏：数据分析文章标签： r语言

于 2022-11-12 09:25:23 首次发布

本文链接：https://blog.csdn.net/Ayue0616/article/details/127816322

版权

数据分析专栏收录该内容

21 篇文章 13 订阅

订阅专栏

文章目录

dplyr介绍
grouped data
two-table verbs
column-wise operations
row-wide operations

dplyr介绍

tidyverse系列应该算是R语言数据分析中的瑞士军刀了，统一的格式，简洁的代码，管道符便于阅读的形式，都能让大家快速上手。R数据科学就是专门讲这个系列的，但是对于很多函数的用法和细节问题，都没有说，所以在使用时还是会经常遇到各种问题。

我根据R数据科学和tidyverse官网的教程，整理了几篇笔记，主要是对tidyverse的各种函数的用法进行详细的演示。

前面已经介绍过了forcats包处理因子型数据，lubridate包处理日期时间格式数据。

下面介绍dplyr包。

在处理数据时，要明确以下几个问题：

明确你的目的
用计算机程序的方式描述你的任务
执行程序

dplyr包可以帮你又快又简单地处理这些问题。tidyr包主要聚焦于把数据变成整洁数据，dplyr包主要功能在于对整洁数据进行各种操作，比如新增、筛选、汇总、合并等。

安装

install.packages("tidyverse")

数据集：starwars

下面使用*星战（starwars）*数据集演示基本的dplyr用法。

starwars数据集共有87行，14列，记录了星战里面的87个人物（机器人、外星人等等）的14个特点，比如姓名、身高、体重、头发颜色、眼睛颜色、种族等。

library(dplyr)
## 
## 载入程辑包：'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dim(starwars)
## [1] 87 14
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

针对单个数据集的操作

今天介绍的都是主要针对单个数据集进行操作的各种函数，也是最常见的类型。

根据作用方式不同，大致可以分为以下3类：

作用于行
- filter()
- slice()
- arrange()
作用于列
- select()
- rename()
- mutate()
- relocate()
作用于一组数据
- summarise()

filter()根据条件筛选行

filter()函数用于筛选符合条件的行，可以用各种表达式进行筛选，比如筛选眼睛颜色是brown并且皮肤颜色是light的行，注意这里不需要使用 & 符号：

starwars %>% filter(skin_color == "light", eye_color == "brown")
## # A tibble: 7 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr> 
## 1 Leia Or~    150    49 brown      light      brown             19 female femin~
## 2 Biggs D~    183    84 black      light      brown             24 male   mascu~
## 3 Cordé       157    NA brown      light      brown             NA female femin~
## 4 Dormé       165    NA brown      light      brown             NA female femin~
## 5 Raymus ~    188    79 brown      light      brown             NA male   mascu~
## 6 Poe Dam~     NA    NA brown      light      brown             NA male   mascu~
## 7 Padmé A~    165    45 brown      light      brown             46 female femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

但是需要注意，filter()函数不支持直接使用行号进行筛选，比如说你想选择第1行到第3行，下面这种写法是错误的：

starwars %>% filter(1:3)

这种情况应该使用slice()函数：

starwars %>% slice(1:3)
## # A tibble: 3 x 14
##   name     height  mass hair_color skin_color  eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr> 
## 1 Luke Sk~    172    77 blond      fair        blue              19 male  mascu~
## 2 C-3PO       167    75 <NA>       gold        yellow           112 none  mascu~
## 3 R2-D2        96    32 <NA>       white, blue red               33 none  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

arrange()进行排序

arrange()函数是用来排序的，根据某一列进行排序。

starwars %>% arrange(height, mass)
## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Yoda         66    17 white      green      brown            896 male  mascu~
##  2 Ratts T~     79    15 none       grey, blue unknown           NA male  mascu~
##  3 Wicket ~     88    20 brown      brown      brown              8 male  mascu~
##  4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu~
##  5 R2-D2        96    32 <NA>       white, bl~ red               33 none  mascu~
##  6 R4-P17       96    NA none       silver, r~ red, blue         NA none  femin~
##  7 R5-D4        97    32 <NA>       white, red red               NA none  mascu~
##  8 Sebulba     112    40 none       grey, red  orange            NA male  mascu~
##  9 Gasgano     122    NA none       white, bl~ black             NA male  mascu~
## 10 Watto       137    NA black      blue, grey yellow            NA male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

desc()函数可以进行倒序：

starwars %>% arrange(desc(height))
## # A tibble: 87 x 14
##    name    height  mass hair_color skin_color  eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr> 
##  1 Yarael~    264    NA none       white       yellow          NA   male  mascu~
##  2 Tarfful    234   136 brown      brown       blue            NA   male  mascu~
##  3 Lama Su    229    88 none       grey        black           NA   male  mascu~
##  4 Chewba~    228   112 brown      unknown     blue           200   male  mascu~
##  5 Roos T~    224    82 none       grey        orange          NA   male  mascu~
##  6 Grievo~    216   159 none       brown, whi~ green, y~       NA   male  mascu~
##  7 Taun We    213    NA none       grey        black           NA   fema~ femin~
##  8 Rugor ~    206    NA none       green       orange          NA   male  mascu~
##  9 Tion M~    206    80 none       grey        black           NA   male  mascu~
## 10 Darth ~    202   136 none       white       yellow          41.9 male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

slice()根据位置选择行

选择第5-10行的数据：

starwars %>% slice(5:10)
## # A tibble: 6 x 14
##   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Leia Or~    150    49 brown       light      brown             19 fema~ femin~
## 2 Owen La~    178   120 brown, grey light      blue              52 male  mascu~
## 3 Beru Wh~    165    75 brown       light      blue              47 fema~ femin~
## 4 R5-D4        97    32 <NA>        white, red red               NA none  mascu~
## 5 Biggs D~    183    84 black       light      brown             24 male  mascu~
## 6 Obi-Wan~    182    77 auburn, wh~ fair       blue-gray         57 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

这其实是一组函数，还有各种变体，比如下面这个，选择前4行：

starwars %>% slice_head(n = 4)
## # A tibble: 4 x 14
##   name     height  mass hair_color skin_color  eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr> 
## 1 Luke Sk~    172    77 blond      fair        blue            19   male  mascu~
## 2 C-3PO       167    75 <NA>       gold        yellow         112   none  mascu~
## 3 R2-D2        96    32 <NA>       white, blue red             33   none  mascu~
## 4 Darth V~    202   136 none       white       yellow          41.9 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

选择最后面的10%的行：

starwars %>% slice_tail(prop = 0.1)
## # A tibble: 8 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr> 
## 1 Sly Moo~    178    48 none       pale       white             NA <NA>   <NA>  
## 2 Tion Me~    206    80 none       grey       black             NA male   mascu~
## 3 Finn         NA    NA black      dark       dark              NA male   mascu~
## 4 Rey          NA    NA brown      light      hazel             NA female femin~
## 5 Poe Dam~     NA    NA brown      light      brown             NA male   mascu~
## 6 BB8          NA    NA none       none       black             NA none   mascu~
## 7 Captain~     NA    NA unknown    unknown    unknown           NA <NA>   <NA>  
## 8 Padmé A~    165    45 brown      light      brown             46 female femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

随机选择10%的行：

starwars %>% slice_sample(prop = 0.1) # n=2
## # A tibble: 8 x 14
##   name    height  mass hair_color skin_color   eye_color birth_year sex   gender
##   <chr>    <int> <dbl> <chr>      <chr>        <chr>          <dbl> <chr> <chr> 
## 1 Lobot      175    79 none       light        blue              37 male  mascu~
## 2 Zam We~    168    55 blonde     fair, green~ yellow            NA fema~ femin~
## 3 Ric Ol~    183    NA brown      fair         blue              NA <NA>  <NA>  
## 4 R4-P17      96    NA none       silver, red  red, blue         NA none  femin~
## 5 Lando ~    177    79 black      dark         brown             31 male  mascu~
## 6 Greedo     173    74 <NA>       green        black             44 male  mascu~
## 7 Ackbar     180    83 none       brown mottle orange            41 male  mascu~
## 8 Rugor ~    206    NA none       green        orange            NA male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

随机选择10%的行，可以重复：

starwars %>% slice_sample(n=10, replace = T)
## # A tibble: 10 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Jek To~    180   110 brown       fair       blue              NA male  mascu~
##  2 Quarsh~    183    NA black       dark       brown             62 <NA>  <NA>  
##  3 Arvel ~     NA    NA brown       fair       brown             NA male  mascu~
##  4 Darth ~    175    80 none        red        yellow            54 male  mascu~
##  5 Finn        NA    NA black       dark       dark              NA male  mascu~
##  6 Ric Ol~    183    NA brown       fair       blue              NA <NA>  <NA>  
##  7 Mace W~    188    84 none        dark       brown             72 male  mascu~
##  8 Jango ~    183    79 black       tan        brown             66 male  mascu~
##  9 San Hi~    191    NA none        grey       gold              NA male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray         57 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

选择某一列中最大或者最小的几个值所在的行，注意不能有NA值：

starwars %>% filter(!is.na(height)) %>% 
  slice_max(height, n = 5) # 选择的这列不能有NA
## # A tibble: 5 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex   gender 
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  
## 1 Yarael ~    264    NA none       white      yellow            NA male  mascul~
## 2 Tarfful     234   136 brown      brown      blue              NA male  mascul~
## 3 Lama Su     229    88 none       grey       black             NA male  mascul~
## 4 Chewbac~    228   112 brown      unknown    blue             200 male  mascul~
## 5 Roos Ta~    224    82 none       grey       orange            NA male  mascul~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

select()选择列

直接根据列名选择列，列名不需要使用引号：

starwars %>% select(hair_color, skin_color, eye_color)
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

选择列名中以color结尾的列：

starwars %>% select(ends_with("color"))
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

选择列名中包含color字样的列：

starwars %>% select(contains("color"))
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

重命名列：

starwars %>% rename(home_world = homeworld)
## # A tibble: 87 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: home_world <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

mutate()新建列

starwars %>% mutate(height_m = height/100, .before = 1)
## # A tibble: 87 x 15
##    height_m name   height  mass hair_color skin_color eye_color birth_year sex  
##       <dbl> <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1     1.72 Luke ~    172    77 blond      fair       blue            19   male 
##  2     1.67 C-3PO     167    75 <NA>       gold       yellow         112   none 
##  3     0.96 R2-D2      96    32 <NA>       white, bl~ red             33   none 
##  4     2.02 Darth~    202   136 none       white      yellow          41.9 male 
##  5     1.5  Leia ~    150    49 brown      light      brown           19   fema~
##  6     1.78 Owen ~    178   120 brown, gr~ light      blue            52   male 
##  7     1.65 Beru ~    165    75 brown      light      blue            47   fema~
##  8     0.97 R5-D4      97    32 <NA>       white, red red             NA   none 
##  9     1.83 Biggs~    183    84 black      light      brown           24   male 
## 10     1.82 Obi-W~    182    77 auburn, w~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

这个效果和上面是一样的，都是把新建的列放在最前面：

starwars %>% 
  mutate(height_m = height/100) %>% 
  select(height_m, everything())
## # A tibble: 87 x 15
##    height_m name   height  mass hair_color skin_color eye_color birth_year sex  
##       <dbl> <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1     1.72 Luke ~    172    77 blond      fair       blue            19   male 
##  2     1.67 C-3PO     167    75 <NA>       gold       yellow         112   none 
##  3     0.96 R2-D2      96    32 <NA>       white, bl~ red             33   none 
##  4     2.02 Darth~    202   136 none       white      yellow          41.9 male 
##  5     1.5  Leia ~    150    49 brown      light      brown           19   fema~
##  6     1.78 Owen ~    178   120 brown, gr~ light      blue            52   male 
##  7     1.65 Beru ~    165    75 brown      light      blue            47   fema~
##  8     0.97 R5-D4      97    32 <NA>       white, red red             NA   none 
##  9     1.83 Biggs~    183    84 black      light      brown           24   male 
## 10     1.82 Obi-W~    182    77 auburn, w~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

新建的列可以直接被使用：

starwars %>%
  mutate(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  ) %>%
  select(BMI, everything())
## # A tibble: 87 x 16
##      BMI name     height  mass hair_color  skin_color eye_color birth_year sex  
##    <dbl> <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr>
##  1  26.0 Luke Sk~    172    77 blond       fair       blue            19   male 
##  2  26.9 C-3PO       167    75 <NA>        gold       yellow         112   none 
##  3  34.7 R2-D2        96    32 <NA>        white, bl~ red             33   none 
##  4  33.3 Darth V~    202   136 none        white      yellow          41.9 male 
##  5  21.8 Leia Or~    150    49 brown       light      brown           19   fema~
##  6  37.9 Owen La~    178   120 brown, grey light      blue            52   male 
##  7  27.5 Beru Wh~    165    75 brown       light      blue            47   fema~
##  8  34.0 R5-D4        97    32 <NA>        white, red red             NA   none 
##  9  25.1 Biggs D~    183    84 black       light      brown           24   male 
## 10  23.2 Obi-Wan~    182    77 auburn, wh~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 7 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>,
## #   height_m <dbl>

只保留新建的列，其他列不要了：

starwars %>%
  transmute(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  )
## # A tibble: 87 x 2
##    height_m   BMI
##       <dbl> <dbl>
##  1     1.72  26.0
##  2     1.67  26.9
##  3     0.96  34.7
##  4     2.02  33.3
##  5     1.5   21.8
##  6     1.78  37.9
##  7     1.65  27.5
##  8     0.97  34.0
##  9     1.83  25.1
## 10     1.82  23.2
## # ... with 77 more rows

relocate()重排列的位置

主要是使用.before和.after参数，控制位置：

starwars %>% relocate(sex:homeworld, .before = height)
## # A tibble: 87 x 14
##    name     sex    gender homeworld height  mass hair_color skin_color eye_color
##    <chr>    <chr>  <chr>  <chr>      <int> <dbl> <chr>      <chr>      <chr>    
##  1 Luke Sk~ male   mascu~ Tatooine     172    77 blond      fair       blue     
##  2 C-3PO    none   mascu~ Tatooine     167    75 <NA>       gold       yellow   
##  3 R2-D2    none   mascu~ Naboo         96    32 <NA>       white, bl~ red      
##  4 Darth V~ male   mascu~ Tatooine     202   136 none       white      yellow   
##  5 Leia Or~ female femin~ Alderaan     150    49 brown      light      brown    
##  6 Owen La~ male   mascu~ Tatooine     178   120 brown, gr~ light      blue     
##  7 Beru Wh~ female femin~ Tatooine     165    75 brown      light      blue     
##  8 R5-D4    none   mascu~ Tatooine      97    32 <NA>       white, red red      
##  9 Biggs D~ male   mascu~ Tatooine     183    84 black      light      brown    
## 10 Obi-Wan~ male   mascu~ Stewjon      182    77 auburn, w~ fair       blue-gray
## # ... with 77 more rows, and 5 more variables: birth_year <dbl>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

summarise()汇总

一般来说和group_by()连用才能发挥威力。

starwars %>% summarise(height = mean(height, na.rm = T))
## # A tibble: 1 x 1
##   height
##    <dbl>
## 1   174.

今天主要是对dplyr有一个大致的认识，熟悉下最常见的操作，后面会根据不同的应用场景继续介绍更多的内容。

grouped data

在现实生活中我们经常会遇到非常多需要分组汇总的情况，单个的汇总价值不大，只有分组之后，才能看出差异，才能表现出数据的价值。

dplyr为我们提供了group_by()函数，主要使用group_by()对数据进行分组，然后再进行各种计算，通过和其他操作进行连接，发挥更加强大的作用。

group_by()

先建立2个分组数据进行演示，还是使用星战数据集。

by_species <- starwars %>% group_by(species)
by_sex_gender <- starwars %>% group_by(sex, gender)

看看这两个对象有什么不同，可以看出和原数据集没什么不同，但是都被分组了！

by_species
## # A tibble: 87 x 14
## # Groups:   species [38]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>
by_sex_gender
## # A tibble: 87 x 14
## # Groups:   sex, gender [6]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

使用tally()函数进行计数：

by_species %>% tally(sort = T)
## # A tibble: 38 x 2
##    species      n
##    <chr>    <int>
##  1 Human       35
##  2 Droid        6
##  3 <NA>         4
##  4 Gungan       3
##  5 Kaminoan     2
##  6 Mirialan     2
##  7 Twi'lek      2
##  8 Wookiee      2
##  9 Zabrak       2
## 10 Aleena       1
## # ... with 28 more rows

和下面这个操作是一样的效果：

by_species %>% summarise(n=n())
## # A tibble: 38 x 2
##    species       n
##    <chr>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         6
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # ... with 28 more rows

除了根据现有的变量进行分组外，还可以根据现有变量的函数进行分组，这样做类似于先mutate()再group_by()。

bmi_breaks <- c(0,18.5,25,30,Inf)

starwars %>% 
  group_by(bmi_cat = cut(mass/(height/100)^2,breaks = bmi_breaks)) %>% 
  tally(sort = T)
## # A tibble: 5 x 2
##   bmi_cat       n
##   <fct>     <int>
## 1 <NA>         28
## 2 (18.5,25]    24
## 3 (25,30]      13
## 4 (30,Inf]     12
## 5 (0,18.5]     10

是不是很神奇？

查看分组信息

group_keys()查看用于分组的组内有哪些类别，可以看到species有38种：

by_species %>% group_keys() 
## # A tibble: 38 x 1
##    species  
##    <chr>    
##  1 Aleena   
##  2 Besalisk 
##  3 Cerean   
##  4 Chagrian 
##  5 Clawdite 
##  6 Droid    
##  7 Dug      
##  8 Ewok     
##  9 Geonosian
## 10 Gungan   
## # ... with 28 more rows

by_sex_gender %>% group_keys()
## # A tibble: 6 x 2
##   sex            gender   
##   <chr>          <chr>    
## 1 female         feminine 
## 2 hermaphroditic masculine
## 3 male           masculine
## 4 none           feminine 
## 5 none           masculine
## 6 <NA>           <NA>

group_indices()查看每一行属于哪个组：

by_species %>% group_indices() # which group each row belongs to
##  [1] 11  6  6 11 11 11 11  6 11 11 11 11 34 11 24 12 11 11 36 11 11  6 31 11 11
## [26] 18 11 11  8 26 11 21 11 10 10 10 38 30  7 38 11 37 32 32 33 35 29 11  3 20
## [51] 37 27 13 23 16  4 11 11 11  9 17 17 11 11 11 11  5  2 15 15 11  1  6 25 19
## [76] 28 14 34 11 38 22 11 11 11  6 38 11

group_rows()查看每个组包括哪些行：

by_species %>% group_rows() # which rows each group contains with
## <list_of<integer>[38]>
## [[1]]
## [1] 72
## 
## [[2]]
## [1] 68
## 
## [[3]]
## [1] 49
## 
## [[4]]
## [1] 56
## 
## [[5]]
## [1] 67
## 
## [[6]]
## [1]  2  3  8 22 73 85
## 
## [[7]]
## [1] 39
## 
## [[8]]
## [1] 29
## 
## [[9]]
## [1] 60
## 
## [[10]]
## [1] 34 35 36
## 
## [[11]]
##  [1]  1  4  5  6  7  9 10 11 12 14 17 18 20 21 24 25 27 28 31 33 41 48 57 58 59
## [26] 63 64 65 66 71 79 82 83 84 87
## 
## [[12]]
## [1] 16
## 
## [[13]]
## [1] 53
## 
## [[14]]
## [1] 77
## 
## [[15]]
## [1] 69 70
## 
## [[16]]
## [1] 55
## 
## [[17]]
## [1] 61 62
## 
## [[18]]
## [1] 26
## 
## [[19]]
## [1] 75
## 
## [[20]]
## [1] 50
## 
## [[21]]
## [1] 32
## 
## [[22]]
## [1] 81
## 
## [[23]]
## [1] 54
## 
## [[24]]
## [1] 15
## 
## [[25]]
## [1] 74
## 
## [[26]]
## [1] 30
## 
## [[27]]
## [1] 52
## 
## [[28]]
## [1] 76
## 
## [[29]]
## [1] 47
## 
## [[30]]
## [1] 38
## 
## [[31]]
## [1] 23
## 
## [[32]]
## [1] 43 44
## 
## [[33]]
## [1] 45
## 
## [[34]]
## [1] 13 78
## 
## [[35]]
## [1] 46
## 
## [[36]]
## [1] 19
## 
## [[37]]
## [1] 42 51
## 
## [[38]]
## [1] 37 40 80 86

group_vars()查看用于聚合的变量名字：

by_sex_gender %>% group_vars() # the name of the grouping variable
## [1] "sex"    "gender"

增加或改变用于聚合的变量

如果把group_by()作用于已经聚合的变量，那数据会被覆盖，比如下面这个，by_species已经被species聚合了，再通过homeworld聚合，那结果只是homeworld的结果：

by_species %>% 
  group_by(homeworld) %>% 
  tally()
## # A tibble: 49 x 2
##    homeworld          n
##    <chr>          <int>
##  1 Alderaan           3
##  2 Aleen Minor        1
##  3 Bespin             1
##  4 Bestine IV         1
##  5 Cato Neimoidia     1
##  6 Cerea              1
##  7 Champala           1
##  8 Chandrila          1
##  9 Concord Dawn       1
## 10 Corellia           2
## # ... with 39 more rows

是不是之前没注意过这些小问题？

通过使用一个参数可以避免这个问题：

by_species %>% 
  group_by(homeworld, .add = T) %>% 
  tally()
## # A tibble: 58 x 3
## # Groups:   species [38]
##    species  homeworld       n
##    <chr>    <chr>       <int>
##  1 Aleena   Aleen Minor     1
##  2 Besalisk Ojom            1
##  3 Cerean   Cerea           1
##  4 Chagrian Champala        1
##  5 Clawdite Zolan           1
##  6 Droid    Naboo           1
##  7 Droid    Tatooine        2
##  8 Droid    <NA>            3
##  9 Dug      Malastare       1
## 10 Ewok     Endor           1
## # ... with 48 more rows

移除聚合的变量

一个被聚合的数据如果不解除聚合，那么后面的操作都会以聚合后的结果呈现出来，所以聚合之后一定要记得解除聚合！

by_species %>% 
  ungroup() %>% 
  tally()
## # A tibble: 1 x 1
##       n
##   <int>
## 1    87

by_sex_gender %>% 
  ungroup(sex) %>% 
  tally()
## # A tibble: 3 x 2
##   gender        n
##   <chr>     <int>
## 1 feminine     17
## 2 masculine    66
## 3 <NA>          4

联合使用

下面这部分主要介绍group_by和其他函数的联合使用：

summarise()

by_species %>%
  summarise(
    n = n(),
    height = mean(height, na.rm = TRUE)
  )
## # A tibble: 38 x 3
##    species       n height
##    <chr>     <int>  <dbl>
##  1 Aleena        1    79 
##  2 Besalisk      1   198 
##  3 Cerean        1   198 
##  4 Chagrian      1   196 
##  5 Clawdite      1   168 
##  6 Droid         6   131.
##  7 Dug           1   112 
##  8 Ewok          1    88 
##  9 Geonosian     1   183 
## 10 Gungan        3   209.
## # ... with 28 more rows

control the grouping variables

通过.groups参数控制聚合变量：

by_sex_gender %>% 
  summarise(n = n()) %>% 
  group_vars()
## `summarise()` has grouped output by 'sex'. You can override using the
## `.groups` argument.
## [1] "sex"

# 只通过sex进行聚合
by_sex_gender %>% 
  summarise(n = n(), .groups = "drop_last") %>% 
  group_vars()
## [1] "sex"

by_sex_gender %>% 
  summarise(n = n(), .groups = "keep") %>% 
  group_vars()
## [1] "sex"    "gender"

# 不聚合了
by_sex_gender %>% 
  summarise(n = n(), .groups = "drop") %>% 
  group_vars()
## character(0)

`select()`/`rename()`/`relocate()`

by_species %>% select(mass) # grouped by species
## Adding missing grouping variables: `species`
## # A tibble: 87 x 2
## # Groups:   species [38]
##    species  mass
##    <chr>   <dbl>
##  1 Human      77
##  2 Droid      75
##  3 Droid      32
##  4 Human     136
##  5 Human      49
##  6 Human     120
##  7 Human      75
##  8 Droid      32
##  9 Human      84
## 10 Human      77
## # ... with 77 more rows

by_species %>% 
  ungroup() %>% 
  select(mass)
## # A tibble: 87 x 1
##     mass
##    <dbl>
##  1    77
##  2    75
##  3    32
##  4   136
##  5    49
##  6   120
##  7    75
##  8    32
##  9    84
## 10    77
## # ... with 77 more rows

arrange()

by_species %>% 
  arrange(desc(mass)) %>% 
  relocate(species, mass)
## # A tibble: 87 x 14
## # Groups:   species [38]
##    species   mass name   height hair_color skin_color eye_color birth_year sex  
##    <chr>    <dbl> <chr>   <int> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Hutt      1358 Jabba~    175 <NA>       green-tan~ orange         600   herm~
##  2 Kaleesh    159 Griev~    216 none       brown, wh~ green, y~       NA   male 
##  3 Droid      140 IG-88     200 none       metal      red             15   none 
##  4 Human      136 Darth~    202 none       white      yellow          41.9 male 
##  5 Wookiee    136 Tarff~    234 brown      brown      blue            NA   male 
##  6 Human      120 Owen ~    178 brown, gr~ light      blue            52   male 
##  7 Trandos~   113 Bossk     190 none       green      red             53   male 
##  8 Wookiee    112 Chewb~    228 brown      unknown    blue           200   male 
##  9 Human      110 Jek T~    180 brown      fair       blue            NA   male 
## 10 Besalisk   102 Dexte~    198 none       brown      yellow          NA   male 
## # ... with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
## #   films <list>, vehicles <list>, starships <list>

通过.by_group参数控制进行排序的先后位置，下面这个例子就是先根据species进行排序，再根据mass进行排序，和上面的不一样哦！

by_species %>% 
  arrange(desc(mass), .by_group = T) %>% 
  relocate(species, mass)
## # A tibble: 87 x 14
## # Groups:   species [38]
##    species   mass name   height hair_color skin_color eye_color birth_year sex  
##    <chr>    <dbl> <chr>   <int> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Aleena      15 Ratts~     79 none       grey, blue unknown           NA male 
##  2 Besalisk   102 Dexte~    198 none       brown      yellow            NA male 
##  3 Cerean      82 Ki-Ad~    198 white      pale       yellow            92 male 
##  4 Chagrian    NA Mas A~    196 none       blue       blue              NA male 
##  5 Clawdite    55 Zam W~    168 blonde     fair, gre~ yellow            NA fema~
##  6 Droid      140 IG-88     200 none       metal      red               15 none 
##  7 Droid       75 C-3PO     167 <NA>       gold       yellow           112 none 
##  8 Droid       32 R2-D2      96 <NA>       white, bl~ red               33 none 
##  9 Droid       32 R5-D4      97 <NA>       white, red red               NA none 
## 10 Droid       NA R4-P17     96 none       silver, r~ red, blue         NA none 
## # ... with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
## #   films <list>, vehicles <list>, starships <list>

`muatate()` and `transmutate()`

starwars %>% 
  select(name, homeworld, mass) %>% 
  group_by(homeworld) %>% 
  mutate(means = mean(mass, na.rm = T), 
         standard_mass = mass - mean(mass, na.rm = T))
## # A tibble: 87 x 5
## # Groups:   homeworld [49]
##    name               homeworld  mass means standard_mass
##    <chr>              <chr>     <dbl> <dbl>         <dbl>
##  1 Luke Skywalker     Tatooine     77  85.4         -8.38
##  2 C-3PO              Tatooine     75  85.4        -10.4 
##  3 R2-D2              Naboo        32  64.2        -32.2 
##  4 Darth Vader        Tatooine    136  85.4         50.6 
##  5 Leia Organa        Alderaan     49  64          -15   
##  6 Owen Lars          Tatooine    120  85.4         34.6 
##  7 Beru Whitesun lars Tatooine     75  85.4        -10.4 
##  8 R5-D4              Tatooine     32  85.4        -53.4 
##  9 Biggs Darklighter  Tatooine     84  85.4         -1.38
## 10 Obi-Wan Kenobi     Stewjon      77  77            0   
## # ... with 77 more rows

min_rank()函数返回顺序（秩次）：

# Overall rank
starwars %>% 
  select(name, homeworld, height) %>% 
  mutate(rank = min_rank(height))
## # A tibble: 87 x 4
##    name               homeworld height  rank
##    <chr>              <chr>      <int> <int>
##  1 Luke Skywalker     Tatooine     172    29
##  2 C-3PO              Tatooine     167    21
##  3 R2-D2              Naboo         96     5
##  4 Darth Vader        Tatooine     202    72
##  5 Leia Organa        Alderaan     150    11
##  6 Owen Lars          Tatooine     178    35
##  7 Beru Whitesun lars Tatooine     165    17
##  8 R5-D4              Tatooine      97     7
##  9 Biggs Darklighter  Tatooine     183    45
## 10 Obi-Wan Kenobi     Stewjon      182    44
## # ... with 77 more rows

先根据homeworld进行分组，再新建列：

# Rank per homeworld
starwars %>% 
  select(name, homeworld, height) %>% 
  group_by(homeworld) %>% 
  mutate(rank = min_rank(height))
## # A tibble: 87 x 4
## # Groups:   homeworld [49]
##    name               homeworld height  rank
##    <chr>              <chr>      <int> <int>
##  1 Luke Skywalker     Tatooine     172     5
##  2 C-3PO              Tatooine     167     4
##  3 R2-D2              Naboo         96     1
##  4 Darth Vader        Tatooine     202    10
##  5 Leia Organa        Alderaan     150     1
##  6 Owen Lars          Tatooine     178     6
##  7 Beru Whitesun lars Tatooine     165     3
##  8 R5-D4              Tatooine      97     1
##  9 Biggs Darklighter  Tatooine     183     7
## 10 Obi-Wan Kenobi     Stewjon      182     1
## # ... with 77 more rows

filter()

筛选每个物种（species）中最高（height）的那一个：

by_species %>%
  select(name, species, height) %>% 
  filter(height == max(height))
## # A tibble: 35 x 3
## # Groups:   species [35]
##    name                  species        height
##    <chr>                 <chr>           <int>
##  1 Greedo                Rodian            173
##  2 Jabba Desilijic Tiure Hutt              175
##  3 Yoda                  Yoda's species     66
##  4 Bossk                 Trandoshan        190
##  5 Ackbar                Mon Calamari      180
##  6 Wicket Systri Warrick Ewok               88
##  7 Nien Nunb             Sullustan         160
##  8 Nute Gunray           Neimodian         191
##  9 Roos Tarpals          Gungan            224
## 10 Watto                 Toydarian         137
## # ... with 25 more rows

去掉只有1个成员的物种：

by_species %>%
  filter(n() != 1) %>% 
  tally()
## # A tibble: 9 x 2
##   species      n
##   <chr>    <int>
## 1 Droid        6
## 2 Gungan       3
## 3 Human       35
## 4 Kaminoan     2
## 5 Mirialan     2
## 6 Twi'lek      2
## 7 Wookiee      2
## 8 Zabrak       2
## 9 <NA>         4

Computing on grouping information

在dplyr verbs内部，可以使用带有cur前缀的函数族访问当前组的各种属性。

cur_data()

cur_data() returns the current group, excluding grouping variables. It’s useful to feed to functions that take a whole data frame. For example, the following code fits a linear model of mass ~ height to each species:

by_species %>%
  filter(n() > 1) %>% 
  mutate(mod = list(lm(mass ~ height, data = cur_data())))
## # A tibble: 58 x 15
## # Groups:   species [9]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 48 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, mod <list>

cur_group() and cur_group_id()

by_species %>%
  arrange(species) %>% 
  select(name, species, homeworld) %>% 
  mutate(id = cur_group_id())
## # A tibble: 87 x 4
## # Groups:   species [38]
##    name            species  homeworld      id
##    <chr>           <chr>    <chr>       <int>
##  1 Ratts Tyerell   Aleena   Aleen Minor     1
##  2 Dexter Jettster Besalisk Ojom            2
##  3 Ki-Adi-Mundi    Cerean   Cerea           3
##  4 Mas Amedda      Chagrian Champala        4
##  5 Zam Wesell      Clawdite Zolan           5
##  6 C-3PO           Droid    Tatooine        6
##  7 R2-D2           Droid    Naboo           6
##  8 R5-D4           Droid    Tatooine        6
##  9 IG-88           Droid    <NA>            6
## 10 R4-P17          Droid    <NA>            6
## # ... with 77 more rows

two-table verbs

可以参考R数据科学这本书中的介绍，非常详细。

处理两个数据集的函数。

根据另一个表的变量新建数据
根据另一个表筛选

合并连接

内连接
- inner_join()
外连接
- 左连接left_join()：保留 x 中的所有观测。
- 右连接right_join()：保留 y 中的所有观测
- 全连接full_join()：保留 x 和 y 中的所有观测。

筛选连接

semi_join(x, y)：保留x表中与y表中的观测相匹配的所有观测
anti_join(x, y)：丢弃x表中与y表中的观测相匹配的所有观测

集合操作

intersect(x, y)：返回既在 x 表，又在 y 表中的观测
union(x, y)：返回 x 表或 y 表中的唯一观测
setdiff(x, y)：返回在 x 表，但不在 y 表中的观测

合并连接

library(nycflights13)
library(dplyr)
# 选择部分数据方便演示
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
glimpse(flights2)
## Rows: 336,776
## Columns: 8
## $ year    <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20~
## $ month   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ day     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ hour    <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6,~
## $ origin  <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK",~
## $ dest    <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO",~
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N~
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B~

glimpse(airlines)
## Rows: 16
## Columns: 2
## $ carrier <chr> "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", "MQ", "O~
## $ name    <chr> "Endeavor Air Inc.", "American Airlines Inc.", "Alaska Airline~

有一列共有的，carrier

flights2 %>% 
  left_join(airlines)
## Joining, by = "carrier"
## # A tibble: 336,776 x 9
##     year month   day  hour origin dest  tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

by = NULL是默认。

glimpse(weather)
## Rows: 26,115
## Columns: 15
## $ origin     <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EW~
## $ year       <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,~
## $ month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ day        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ hour       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, ~
## $ temp       <dbl> 39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39.~
## $ dewp       <dbl> 26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28.~
## $ humid      <dbl> 59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62.~
## $ wind_dir   <dbl> 270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330,~
## $ wind_speed <dbl> 10.35702, 8.05546, 11.50780, 12.65858, 12.65858, 11.50780, ~
## $ wind_gust  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20.~
## $ precip     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ pressure   <dbl> 1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 101~
## $ visib      <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,~
## $ time_hour  <dttm> 2013-01-01 01:00:00, 2013-01-01 02:00:00, 2013-01-01 03:00~

flights2 %>% left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

glimpse(planes)
## Rows: 3,322
## Columns: 9
## $ tailnum      <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW~
## $ year         <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199~
## $ type         <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi~
## $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU~
## $ model        <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145~
## $ engines      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ seats        <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5~
## $ speed        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ engine       <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb~

flights2 %>% left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type            
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult~
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult~
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult~
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult~
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult~
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult~
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult~
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult~
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult~
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

glimpse(airports)
## Rows: 1,458
## Columns: 8
## $ faa   <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", "~
## $ name  <chr> "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbur~
## $ lat   <dbl> 41.13047, 32.46057, 41.98934, 41.43191, 31.07447, 36.37122, 41.4~
## $ lon   <dbl> -80.61958, -85.68003, -88.10124, -74.39156, -81.42778, -82.17342~
## $ alt   <dbl> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 10~
## $ tz    <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5, ~
## $ dst   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A",~
## $ tzone <chr> "America/New_York", "America/Chicago", "America/Chicago", "Ameri~

如果两个数据集中列名不一样也可以进行合并：

flights2 %>% left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Georg~  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Georg~  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami~  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts~  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chica~  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort ~  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi~  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan~  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica~  42.0 -87.9   668
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## #   tzone <chr>

下面是一个简单的左连接的例子：

df1 <- tibble(x = c(1, 2), y = 2:1)
df2 <- tibble(x = c(3, 1), a = 10, b = "a")

df1
## # A tibble: 2 x 2
##       x     y
##   <dbl> <int>
## 1     1     2
## 2     2     1
df2
## # A tibble: 2 x 3
##       x     a b    
##   <dbl> <dbl> <chr>
## 1     3    10 a    
## 2     1    10 a

df1 %>% inner_join(df2)
## Joining, by = "x"
## # A tibble: 1 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a

df1 %>% left_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA>

右连接：

df1 %>% right_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     3    NA    10 a

df2 %>% left_join(df1)
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     a b         y
##   <dbl> <dbl> <chr> <int>
## 1     3    10 a        NA
## 2     1    10 a         2

全连接：

df1 %>% full_join(df2)
## Joining, by = "x"
## # A tibble: 3 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA> 
## 3     3    NA    10 a

筛选连接

df1 <- tibble(x = c(1, 1, 3, 4), y = 1:4)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))

df1
## # A tibble: 4 x 2
##       x     y
##   <dbl> <int>
## 1     1     1
## 2     1     2
## 3     3     3
## 4     4     4
df2
## # A tibble: 3 x 2
##       x z    
##   <dbl> <chr>
## 1     1 a    
## 2     1 b    
## 3     2 a

df1 %>% nrow()
## [1] 4

df1 %>% inner_join(df2, by = "x")
## # A tibble: 4 x 3
##       x     y z    
##   <dbl> <int> <chr>
## 1     1     1 a    
## 2     1     1 b    
## 3     1     2 a    
## 4     1     2 b

df1 %>% semi_join(df2, by = "x")
## # A tibble: 2 x 2
##       x     y
##   <dbl> <int>
## 1     1     1
## 2     1     2

df1 %>% anti_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 2
##       x     y
##   <dbl> <int>
## 1     3     3
## 2     4     4

集合操作

(df1 <- tibble(x = 1:2, y = c(1L, 1L)))
## # A tibble: 2 x 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     1
(df2 <- tibble(x = 1:2, y = 1:2))
## # A tibble: 2 x 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     2

intersect(df1, df2) # 取交集
## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     1     1

union(df1, df2) # 并集
## # A tibble: 3 x 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     1
## 3     2     2

setdiff(df1, df2)
## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     2     1
setdiff(df2, df1)
## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     2     2

下面是一些集合操作的示意图：

column-wise operations

主要是介绍across函数的用法，这是dplyr1.0才出来的一个函数，大大简化了代码

可用于对多列做同一个操作。

library(dplyr, warn.conflicts = FALSE)

across()有两个基本参数：

.cols：选择你想操作的列
.fn：你想进行的操作，可以使一个函数或者多个函数组成的列表

可以替代_if()，at_()，all_()

starwars %>% 
  summarise(across(where(is.character), n_distinct))
## # A tibble: 1 x 8
##    name hair_color skin_color eye_color   sex gender homeworld species
##   <int>      <int>      <int>     <int> <int>  <int>     <int>   <int>
## 1    87         13         31        15     5      3        49      38

可以直接写列名：

starwars %>% 
  group_by(species) %>% 
  filter(n() > 1) %>% 
  summarise(across(c(sex, gender, homeworld), n_distinct))
## # A tibble: 9 x 4
##   species    sex gender homeworld
##   <chr>    <int>  <int>     <int>
## 1 Droid        1      2         3
## 2 Gungan       1      1         1
## 3 Human        2      2        16
## 4 Kaminoan     2      2         1
## 5 Mirialan     1      1         1
## 6 Twi'lek      2      2         1
## 7 Wookiee      1      1         1
## 8 Zabrak       1      1         2
## 9 <NA>         1      1         3

也可以和where函数连用，省时省力：

starwars %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
## # A tibble: 10 x 4
##    homeworld height  mass birth_year
##    <chr>      <dbl> <dbl>      <dbl>
##  1 Alderaan    176.  64         43  
##  2 Corellia    175   78.5       25  
##  3 Coruscant   174.  50         91  
##  4 Kamino      208.  83.1       31.5
##  5 Kashyyyk    231  124        200  
##  6 Mirial      168   53.1       49  
##  7 Naboo       175.  64.2       55  
##  8 Ryloth      179   55         48  
##  9 Tatooine    170.  85.4       54.6
## 10 <NA>        139.  82        334.

如果没有缺失值，可以直接写mean，

library(tidyr)
starwars %>% drop_na() %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), mean))
## # A tibble: 4 x 4
##   homeworld height  mass birth_year
##   <chr>      <dbl> <dbl>      <dbl>
## 1 Corellia    175   78.5       25  
## 2 Mirial      168   53.1       49  
## 3 Naboo       177   62         60  
## 4 Tatooine    181.  96         37.6

acorss支持多个函数同时使用，只要放入列表中即可：

min_max <- list(
  min = ~min(.x, na.rm = TRUE), 
  max = ~max(.x, na.rm = TRUE)
)
starwars %>% summarise(across(where(is.numeric), min_max))
## # A tibble: 1 x 6
##   height_min height_max mass_min mass_max birth_year_min birth_year_max
##        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
## 1         66        264       15     1358              8            896
starwars %>% summarise(across(c(height, mass, birth_year), min_max))
## # A tibble: 1 x 6
##   height_min height_max mass_min mass_max birth_year_min birth_year_max
##        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
## 1         66        264       15     1358              8            896

当然也是支持glue的：

starwars %>% summarise(across(where(is.numeric), min_max, .names = "{.fn}.{.col}"))
## # A tibble: 1 x 6
##   min.height max.height min.mass max.mass min.birth_year max.birth_year
##        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
## 1         66        264       15     1358              8            896

starwars %>% summarise(across(c(height, mass, birth_year), min_max, .names = "{.fn}.{.col}"))
## # A tibble: 1 x 6
##   min.height max.height min.mass max.mass min.birth_year max.birth_year
##        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
## 1         66        264       15     1358              8            896

分开写也是可以的：

starwars %>% summarise(
  across(c(height, mass, birth_year), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
  across(c(height, mass, birth_year), ~max(.x, na.rm = TRUE), .names = "max_{.col}")
)
## # A tibble: 1 x 6
##   min_height min_mass min_birth_year max_height max_mass max_birth_year
##        <int>    <dbl>          <dbl>      <int>    <dbl>          <dbl>
## 1         66       15              8        264     1358            896

这种情况不能使用where(is.numeric)，因为第2个across会使用新创建的列（“min_height”, “min_mass” and “min_birth_year”）。

可以放在tibble里解决：

starwars %>% summarise(
  tibble(
    across(where(is.numeric), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
    across(where(is.numeric), ~max(.x, na.rm = TRUE), .names = "max_{.col}")  
  )
)
## # A tibble: 1 x 6
##   min_height min_mass min_birth_year max_height max_mass max_birth_year
##        <int>    <dbl>          <dbl>      <int>    <dbl>          <dbl>
## 1         66       15              8        264     1358            896

陷阱

在使用where(is.numeric)，要注意下面这种情况：

df <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))

df %>% 
  summarise(n = n(), across(where(is.numeric), sd))
##    n x        y
## 1 NA 1 4.041452

n这里是3，是一个常数，所以它的sd变成了NA，可以通过换一下顺序解决：

df %>% summarise(across(where(is.numeric), sd),
                 n = n()
                 )
##   x        y n
## 1 1 4.041452 3

或者通过下面两种方法解决：

df %>% 
  summarise(n = n(), across(where(is.numeric) & !n, sd))
##   n x        y
## 1 3 1 4.041452

df %>% 
  summarise(
    tibble(n = n(), across(where(is.numeric), sd))
  )
##   n x        y
## 1 3 1 4.041452

across其他连用

还可以和group_by()/count()/distinct()连用。

和filter()连用

across()不能直接和filter()连用，和filter()连用的是if_any()和if_all()。

if_any()：任何一列满足条件即可
if_all()：所有列都要满足条件

starwars %>% 
  filter(if_any(everything(), ~ !is.na(.x)))
## # A tibble: 87 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

starwars %>% 
  filter(if_all(everything(), ~ !is.na(.x)))
## # A tibble: 29 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  3 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  4 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  5 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  6 Biggs ~    183    84 black       light      brown           24   male  mascu~
##  7 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
##  8 Anakin~    188    84 blond       fair       blue            41.9 male  mascu~
##  9 Chewba~    228   112 brown       unknown    blue           200   male  mascu~
## 10 Han So~    180    80 brown       fair       brown           29   male  mascu~
## # ... with 19 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

row-wide operations

在tidyverse中，整洁数据一般都是每一行是一个观测，每一列是一个变量，基本上所有操作都是基于整洁的数据进行的，都是对某列做什么操作。但有时候我们也需要对某行做一些操作，dplyr中现在提供了rowwise()函数快速执行对行的操作。

简介

library(dplyr, warn.conflicts = FALSE)

rowwise()和group_by()很像，本身不做任何操作，但是使用了rowwise之后，再和mutate()等函数连用时，就会变成按照行进行操作！

df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df %>% rowwise()
## # A tibble: 2 x 3
## # Rowwise: 
##       x     y     z
##   <int> <int> <int>
## 1     1     3     5
## 2     2     4     6

假如你想分别计算每行的均值（只是一个例子），不使用rowwise()函数，得到的结果是所有数据的均值，很明显不是想要的：

df %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
##       x     y     z     m
##   <int> <int> <int> <dbl>
## 1     1     3     5   3.5
## 2     2     4     6   3.5

使用rowwise()之后，神奇的事情发生了，变成了按行操作！

df %>% rowwise() %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
## # Rowwise: 
##       x     y     z     m
##   <int> <int> <int> <dbl>
## 1     1     3     5     3
## 2     2     4     6     4

df <- tibble(name = c("Mara", "Hadley"), x = 1:2, y = 3:4, z = 5:6)
df
## # A tibble: 2 x 4
##   name       x     y     z
##   <chr>  <int> <int> <int>
## 1 Mara       1     3     5
## 2 Hadley     2     4     6

按照行计算均值：

df %>% 
  rowwise() %>% 
  summarise(m = mean(c(x, y, z)))
## # A tibble: 2 x 1
##       m
##   <dbl>
## 1     3
## 2     4

根据name这一列按照行计算均值：

df %>% 
  rowwise(name) %>% 
  summarise(m = mean(c(x, y, z)))
## `summarise()` has grouped output by 'name'. You can override using
## the `.groups` argument.
## # A tibble: 2 x 2
## # Groups:   name [2]
##   name       m
##   <chr>  <dbl>
## 1 Mara       3
## 2 Hadley     4

rowwise()可以看做是group_by()的特殊形式，本身也是对数据先进行聚合操作，所以如果要解除聚合，也要使用ungroup()函数。

对行进行汇总统计

df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
## # A tibble: 6 x 5
##      id     w     x     y     z
##   <int> <int> <int> <int> <int>
## 1     1    10    20    30    40
## 2     2    11    21    31    41
## 3     3    12    22    32    42
## 4     4    13    23    33    43
## 5     5    14    24    34    44
## 6     6    15    25    35    45

接下来要进行按行操作了！

rf <- df %>% rowwise(id)

计算加和：

rf %>% mutate(total = sum(c(w, x, y, z)))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##   <int> <int> <int> <int> <int> <int>
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120

rf %>% summarise(total = sum(c(w, x, y, z)))
## `summarise()` has grouped output by 'id'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 2
## # Groups:   id [6]
##      id total
##   <int> <int>
## 1     1   100
## 2     2   104
## 3     3   108
## 4     4   112
## 5     5   116
## 6     6   120

rf %>% mutate(total = sum(c_across(w:z)))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##   <int> <int> <int> <int> <int> <int>
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120

rf %>% mutate(total = sum(c_across(where(is.numeric))))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##   <int> <int> <int> <int> <int> <int>
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120

可以和列操作联合使用：

rf %>% 
  mutate(total = sum(c_across(w:z))) %>% 
  ungroup() %>% 
  mutate(across(w:z, ~ . / total))
## # A tibble: 6 x 6
##      id     w     x     y     z total
##   <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1     1 0.1   0.2   0.3   0.4     100
## 2     2 0.106 0.202 0.298 0.394   104
## 3     3 0.111 0.204 0.296 0.389   108
## 4     4 0.116 0.205 0.295 0.384   112
## 5     5 0.121 0.207 0.293 0.379   116
## 6     6 0.125 0.208 0.292 0.375   120

可以和``rowSums()函数和rowMeans()`等函数联合使用。

list columns

motivation

df <- tibble(
  x = list(1, 2:3, 4:6)
)

df
## # A tibble: 3 x 1
##   x        
##   <list>   
## 1 <dbl [1]>
## 2 <int [2]>
## 3 <int [3]>

df %>% mutate(l = length(x))
## # A tibble: 3 x 2
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     3
## 2 <int [2]>     3
## 3 <int [3]>     3

df %>% mutate(l = lengths(x))
## # A tibble: 3 x 2
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3

df %>% mutate(l = sapply(x, length))
## # A tibble: 3 x 2
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3
df %>% mutate(l = purrr::map_int(x, length))
## # A tibble: 3 x 2
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3

df %>% 
  rowwise() %>% 
  mutate(l = length(x))
## # A tibble: 3 x 2
## # Rowwise: 
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3

subsetting

df <- tibble(g = 1:2, y = list(1:3, "a"))
gf <- df %>% group_by(g)
rf <- df %>% rowwise(g)

gf %>% mutate(type = typeof(y), length = length(y))
## # A tibble: 2 x 4
## # Groups:   g [2]
##       g y         type  length
##   <int> <list>    <chr>  <int>
## 1     1 <int [3]> list       1
## 2     2 <chr [1]> list       1
rf %>% mutate(type = typeof(y), length = length(y))
## # A tibble: 2 x 4
## # Rowwise:  g
##       g y         type      length
##   <int> <list>    <chr>      <int>
## 1     1 <int [3]> integer        3
## 2     2 <chr [1]> character      1

# grouped
out1 <- integer(2)
for (i in 1:2) {
  out1[[i]] <- length(df$y[i])
}
out1
## [1] 1 1

# rowwise
out2 <- integer(2)
for (i in 1:2) {
  out2[[i]] <- length(df$y[[i]])
}
out2
## [1] 3 1

gf %>% mutate(y2 = y)
## # A tibble: 2 x 3
## # Groups:   g [2]
##       g y         y2       
##   <int> <list>    <list>   
## 1     1 <int [3]> <int [3]>
## 2     2 <chr [1]> <chr [1]>
rf %>% mutate(y2 = y)
## Error in `mutate()`:
## ! Problem while computing `y2 = y`.
## x `y2` must be size 1, not 3.
## i Did you mean: `y2 = list(y)` ?
## i The error occurred in row 1.
rf %>% mutate(y2 = list(y))
## # A tibble: 2 x 3
## # Rowwise:  g
##       g y         y2       
##   <int> <list>    <list>   
## 1     1 <int [3]> <int [3]>
## 2     2 <chr [1]> <chr [1]>

modeling

by_cyl <- mtcars %>% nest_by(cyl)
by_cyl
## # A tibble: 3 x 2
## # Rowwise:  cyl
##     cyl                data
##   <dbl> <list<tibble[,10]>>
## 1     4           [11 x 10]
## 2     6            [7 x 10]
## 3     8           [14 x 10]

mods <- by_cyl %>% mutate(mod = list(lm(mpg ~ wt, data = data)))
mods
## # A tibble: 3 x 3
## # Rowwise:  cyl
##     cyl                data mod   
##   <dbl> <list<tibble[,10]>> <list>
## 1     4           [11 x 10] <lm>  
## 2     6            [7 x 10] <lm>  
## 3     8           [14 x 10] <lm>

mods <- mods %>% mutate(pred = list(predict(mod, data)))
mods
## # A tibble: 3 x 4
## # Rowwise:  cyl
##     cyl                data mod    pred      
##   <dbl> <list<tibble[,10]>> <list> <list>    
## 1     4           [11 x 10] <lm>   <dbl [11]>
## 2     6            [7 x 10] <lm>   <dbl [7]> 
## 3     8           [14 x 10] <lm>   <dbl [14]>

mods %>% summarise(rmse = sqrt(mean((pred - data$mpg) ^ 2)))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 2
## # Groups:   cyl [3]
##     cyl  rmse
##   <dbl> <dbl>
## 1     4 3.01 
## 2     6 0.985
## 3     8 1.87
mods %>% summarise(rsq = summary(mod)$r.squared)
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 2
## # Groups:   cyl [3]
##     cyl   rsq
##   <dbl> <dbl>
## 1     4 0.509
## 2     6 0.465
## 3     8 0.423
mods %>% summarise(broom::glance(mod))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 13
## # Groups:   cyl [3]
##     cyl r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##   <dbl>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     4     0.509         0.454  3.33      9.32  0.0137     1 -27.7   61.5  62.7
## 2     6     0.465         0.357  1.17      4.34  0.0918     1  -9.83  25.7  25.5
## 3     8     0.423         0.375  2.02      8.80  0.0118     1 -28.7   63.3  65.2
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

mods %>% summarise(broom::tidy(mod))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 6
## # Groups:   cyl [3]
##     cyl term        estimate std.error statistic    p.value
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1     4 (Intercept)    39.6      4.35       9.10 0.00000777
## 2     4 wt             -5.65     1.85      -3.05 0.0137    
## 3     6 (Intercept)    28.4      4.18       6.79 0.00105   
## 4     6 wt             -2.78     1.33      -2.08 0.0918    
## 5     8 (Intercept)    23.9      3.01       7.94 0.00000405
## 6     8 wt             -2.19     0.739     -2.97 0.0118

repeated function calls

simulations

df <- tribble(
  ~ n, ~ min, ~ max,
    1,     0,     1,
    2,    10,   100,
    3,   100,  1000,
)

df %>% 
  rowwise() %>% 
  mutate(data = list(runif(n, min, max)))
## # A tibble: 3 x 4
## # Rowwise: 
##       n   min   max data     
##   <dbl> <dbl> <dbl> <list>   
## 1     1     0     1 <dbl [1]>
## 2     2    10   100 <dbl [2]>
## 3     3   100  1000 <dbl [3]>

df %>% 
  rowwise() %>% 
  mutate(data = runif(n, min, max))
## Error in `mutate()`:
## ! Problem while computing `data = runif(n, min, max)`.
## x `data` must be size 1, not 2.
## i Did you mean: `data = list(runif(n, min, max))` ?
## i The error occurred in row 2.

multiple combinations

df <- expand.grid(mean = c(-1, 0, 1), sd = c(1, 10, 100))

df %>% 
  rowwise() %>% 
  mutate(data = list(rnorm(10, mean, sd)))
## # A tibble: 9 x 3
## # Rowwise: 
##    mean    sd data      
##   <dbl> <dbl> <list>    
## 1    -1     1 <dbl [10]>
## 2     0     1 <dbl [10]>
## 3     1     1 <dbl [10]>
## 4    -1    10 <dbl [10]>
## 5     0    10 <dbl [10]>
## 6     1    10 <dbl [10]>
## 7    -1   100 <dbl [10]>
## 8     0   100 <dbl [10]>
## 9     1   100 <dbl [10]>

varying functions

df <- tribble(
   ~rng,     ~params,
   "runif",  list(n = 10), 
   "rnorm",  list(n = 20),
   "rpois",  list(n = 10, lambda = 5),
) %>%
  rowwise()

df %>% 
  mutate(data = list(do.call(rng, params)))
## # A tibble: 3 x 3
## # Rowwise: 
##   rng   params           data      
##   <chr> <list>           <list>    
## 1 runif <named list [1]> <dbl [10]>
## 2 rnorm <named list [1]> <dbl [20]>
## 3 rpois <named list [2]> <int [10]>