R语言数据分析（四）_r语言 distinct-CSDN博客

本文链接：https://blog.csdn.net/weixin_72071150/article/details/136221043

R语言数据分析（四）

前言

上一节我们学习了数据的可视化工具，但是对于数据很少能够直接拿到形式正确的数据。通常需要创建一些新变量或者对变量进行整合后使用。本节中将介绍这些用于数据转换的方法。我们使用的是dplyr包中的函数，它也是tidyverse包的成员之一。另外本节将使用到用做演示的数据来自nycflights13，请提前加载相关包。

library(nycflights13)
library(tidyverse)

一、`dplyr`基础知识

这节将学习主要的dplyr函数，它们可以解决大多数数据操作。但他们有一些共同点：

第一个参数始终是数据框
后续参数通常使用变量名称（不带引号）来描述需要对哪些列进行操作
输出始终是新数据框

因为每个函数都具有其相应的功能，在解决复杂问题的时候有可能要组合多个函数，我们将使用管道来做到这一点。

R中的管道函数有两种：%>%和|>（%>%需要加载magrittr包，|>是R4.1版本以后加入的，Ctrl+Shift+M是管道函数的快捷键，可以在RStudio中进行设置默认管道函数）。

dplyr包中的函数操作的内容分为四类：行、列、组或表（rows，columns，groups，tables）。

二、行

行操作最重要的函数是filter()，它用于更改行的内容而不更改其顺序。以及arrange()，它用于更改行的顺序而不更改行的内容。另外还有distinct()可以找到具有唯一值且不同的行，并且它也可以操作列。

2.1 `filter()`

filter()允许根据列保留行。第一个参数是数据框，第二个集后续参数是逻辑判断，值为true的才能保留行。

nycflights13::flights是一个tibble表格，它是tidyverse的一种特殊形式的数据框。其中的数据表示的是2013年从纽约市起飞的336776个航班信息。

flights
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

我们可以使用filter()找到所有晚点超过120分钟（两小时）的航班：

flights |> 
  filter(dep_delay > 120)
#> # A tibble: 9,723 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      848           1835       853     1001           1950
#>  2  2013     1     1      957            733       144     1056            853
#>  3  2013     1     1     1114            900       134     1447           1222
#>  4  2013     1     1     1540           1338       122     2020           1825
#>  5  2013     1     1     1815           1325       290     2120           1542
#>  6  2013     1     1     1842           1422       260     1958           1535
#>  7  2013     1     1     1856           1645       131     2212           2005
#>  8  2013     1     1     1934           1725       129     2126           1855
#>  9  2013     1     1     1938           1703       155     2109           1823
#> 10  2013     1     1     1942           1705       157     2124           1830
#> # ℹ 9,713 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

在之前介绍过的逻辑检验的内容都可以用在这里作为逻辑测试（详见R语言入门学习笔记（三））。

请注意，在运行fliter()时会执行过滤操作创建一个新的数据框，然后显示它，不会修改现有的flights数据集。如果要保存结果，需要结合赋值运算符。

2.2 `arrange()`

arrange()可以根据列的值来更改行的顺序。需要一个数据框和一组列名（或者更复杂的表达式）来进行排序。如果提供多个列名，则按每个附加列进行前后依次排序。

flights |> 
  arrange(year, month, day, dep_time)
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

另外，可以使用desc()使得排序顺序按降序进行：

flights |> 
  arrange(desc(dep_delay))
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     9      641            900      1301     1242           1530
#>  2  2013     6    15     1432           1935      1137     1607           2120
#>  3  2013     1    10     1121           1635      1126     1239           1810
#>  4  2013     9    20     1139           1845      1014     1457           2210
#>  5  2013     7    22      845           1600      1005     1044           1815
#>  6  2013     4    10     1100           1900       960     1342           2211
#>  7  2013     3    17     2321            810       911      135           1020
#>  8  2013     6    27      959           1900       899     1236           2226
#>  9  2013     7    22     2257            759       898      121           1026
#> 10  2013    12     5      756           1700       896     1058           2020
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

2.3 `distinct()`

distinct()用于查找数据集中的所有唯一值，往往是对行进行操作，但也可以提供列名：

# 移除重复的行
flights |> 
  distinct()
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> 
  distinct(origin, dest)
#> # A tibble: 224 × 2
#>    origin dest 
#>    <chr>  <chr>
#>  1 EWR    IAH  
#>  2 LGA    IAH  
#>  3 JFK    MIA  
#>  4 JFK    BQN  
#>  5 LGA    ATL  
#>  6 EWR    ORD  
#>  7 EWR    FLL  
#>  8 LGA    IAD  
#>  9 JFK    MCO  
#> 10 LGA    ORD  
#> # ℹ 214 more rows

如果想要在筛选时保留其他列，可以使用.keep_all = TRUE选项。

如果想要查询的是重复值出现的次数，可以使用count()函数，然后使用sort = TRUE参数按照降序排列它们。

flights |> 
  count(origin, dest, sort = TRUE)
#> # A tibble: 224 × 3
#>    origin dest      n
#>    <chr>  <chr> <int>
#>  1 JFK    LAX   11262
#>  2 LGA    ATL   10263
#>  3 LGA    ORD    8857
#>  4 JFK    SFO    8204
#>  5 LGA    CLT    6168
#>  6 EWR    ORD    6100
#>  7 JFK    BOS    5898
#>  8 LGA    MIA    5781
#>  9 JFK    MCO    5464
#> 10 EWR    BOS    5327
#> # ℹ 214 more rows

三、列

四个重要的函数可以在不更改行的情况下改变列：mutate()创建派生自现有列的新列、select()更改现有的列，rename()该更现有列的名称、relocate()更改列的位置。

3.1 `mutate()`

mutate()是为数据框通过现有列添加新列的方法。

# 计算航班延误时长及每小时英里数
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )
#> # A tibble: 336,776 × 21
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>

默认情况下，新列放在最右侧，可以通过.before = 1将变量添加到最左侧。.代表的是函数的意思，不是创建的新变量的名字，还可以通过.after = day参数将新列添加到day列的后面。

或者可以控制保留一定数量的列，可以通过.keep参数进行保留，最常用的是.keep = used，它用于保留mutate步骤中涉及的或创建的列。

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours,
    .keep = "used"
  )
#> # A tibble: 336,776 × 6
#>    dep_delay arr_delay air_time  gain hours gain_per_hour
#>        <dbl>     <dbl>    <dbl> <dbl> <dbl>         <dbl>
#>  1         2        11      227    -9 3.78          -2.38
#>  2         4        20      227   -16 3.78          -4.23
#>  3         2        33      160   -31 2.67         -11.6 
#>  4        -1       -18      183    17 3.05           5.57
#>  5        -6       -25      116    19 1.93           9.83
#>  6        -4        12      150   -16 2.5           -6.4 
#>  7        -5        19      158   -24 2.63          -9.11
#>  8        -3       -14       53    11 0.883         12.5 
#>  9        -3        -8      140     5 2.33           2.14
#> 10        -2         8      138   -10 2.3           -4.35
#> # ℹ 336,766 more rows

3.2 `select()`

一个数据集可能会有成百上千个变量，这种情况下，想要快速得到想要的列，可以使用select()基于变量名称放大其子集：

# 按名称选择列
flights |> 
  select(year, month, day)

flights |> 
  select(year:day)

flights |> 
  select(!year:day)

# 选择所有字符串列
flights |> 
  select(where(is.character))

在select()可以使用许多辅助函数：

start_with("ab")：匹配以“AB”开头的名称
ends_with("yz")：匹配以“yz”结尾的名称
contains("lm")：匹配包含“lm”的名称
num_range("x", 1:3)：匹配x1，x2，x3

在后面学习了正则表达式之后，就可以用matches()来选择与模式匹配的变量了。

另外，还可以使用=给变量重命名（新名称 = 旧名称）：

flights |> 
  select(tail_num = tailnum)
  #> # A tibble: 336,776 × 1
#>    tail_num
#>    <chr>   
#>  1 N14228  
#>  2 N24211  
#>  3 N619AA  
#>  4 N804JB  
#>  5 N668DN  
#>  6 N39463  
#>  7 N516JB  
#>  8 N829AS  
#>  9 N593JB  
#> 10 N3ALAA  
#> # ℹ 336,766 more rows

3.3 `rename()`

如果想要保留所有的变量只想要重命名一些变量，可以使用rename()：

flights |> 
  rename(tail_num = tailnum)
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

如果有一堆名称不一致的列，可以使用jaintor::clean_names()对其进行自动清理，请自行查看帮助文件。

3.4 `relocate()`

使用relocate()可以移动变量。默认情况下会将变量移动到最前面。

flights |> 
  relocate(time_hour, air_time)
#> # A tibble: 336,776 × 19
#>    time_hour           air_time  year month   day dep_time sched_dep_time
#>    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
#>  1 2013-01-01 05:00:00      227  2013     1     1      517            515
#>  2 2013-01-01 05:00:00      227  2013     1     1      533            529
#>  3 2013-01-01 05:00:00      160  2013     1     1      542            540
#>  4 2013-01-01 05:00:00      183  2013     1     1      544            545
#>  5 2013-01-01 06:00:00      116  2013     1     1      554            600
#>  6 2013-01-01 05:00:00      150  2013     1     1      554            558
#>  7 2013-01-01 06:00:00      158  2013     1     1      555            600
#>  8 2013-01-01 06:00:00       53  2013     1     1      557            600
#>  9 2013-01-01 06:00:00      140  2013     1     1      557            600
#> 10 2013-01-01 06:00:00      138  2013     1     1      558            600
#> # ℹ 336,766 more rows
#> # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#> #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>

当然也可以使用前面提到的.before和.after参数指定位置：

flights |> 
  relocate(year:dep_time, .after = time_hour)
flights |> 
  relocate(starts_with("arr"), .before = dep_time)

四、组

目前为止，已经学习了处理行和列的数据，当组的功能加入后，将会更加强大。本节将介绍group_by()函数和summarize()函数和slice系列函数。

4.1 `group_by()`

该函数用于将数据集划分为对于分析有意义的组：

flights |> 
  group_by(month)
#> # A tibble: 336,776 × 19
#> # Groups:   month [12]
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

group_by()函数不会更改数据，输出表明它是按月进行分组的。这意味着后续操作将按月进行，group_by()将此分组特征（作为类）添加到数据框中。

4.2 `summarize()`

最重要的分组操作就是汇总，如果计算单个汇总统计数据，则会将数据框减少为每个组一行，操作如下：

# 计算每月的平均出发延迟
flights |> 
  group_by(month) |> 
  summarize(
    avg_delay = mean.Date(dep_delay)
  )
#> # A tibble: 12 × 2
#>    month avg_delay
#>    <int> <date>   
#>  1     1 NA       
#>  2     2 NA       
#>  3     3 NA       
#>  4     4 NA       
#>  5     5 NA       
#>  6     6 NA       
#>  7     7 NA       
#>  8     8 NA       
#>  9     9 NA       
#> 10    10 NA       
#> 11    11 NA       
#> 12    12 NA

这里全是NA值，这是不合理的，我们观察数据发现部分航班缺少数据，因此我们可以在计算平均值时加入na.rm = TRUE：

flights |> 
  group_by(month) |> 
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE)
  )
#> # A tibble: 12 × 2
#>    month avg_delay
#>    <int>     <dbl>
#>  1     1     10.0 
#>  2     2     10.8 
#>  3     3     13.2 
#>  4     4     13.9 
#>  5     5     13.0 
#>  6     6     20.8 
#>  7     7     21.7 
#>  8     8     12.6 
#>  9     9      6.72
#> 10    10      6.24
#> 11    11      5.44
#> 12    12     16.6

可以在每次调用summarize()时创建任意数量的信息汇总，有一个很有用的信息是n()，它返回的是每个组中包含的行数：

flights |> 
  group_by(month) |> 
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    n = n()
  )
#> # A tibble: 12 × 3
#>    month avg_delay     n
#>    <int>     <dbl> <int>
#>  1     1     10.0  27004
#>  2     2     10.8  24951
#>  3     3     13.2  28834
#>  4     4     13.9  28330
#>  5     5     13.0  28796
#>  6     6     20.8  28243
#>  7     7     21.7  29425
#>  8     8     12.6  29327
#>  9     9      6.72 27574
#> 10    10      6.24 28889
#> 11    11      5.44 27268
#> 12    12     16.6  28135

4.3 `slice_`函数

五个方便的函数允许提取每个组的特定行：

df |> slice_head(n = 1)：从每个组中获得第一行
df |> slice_tail(n = 1)：从每个组中获取最后一行
df |> slice_min(x, n = 1)：取x列值最小的行（最值不止一行则返回所有行，若想返回一行可以设置参数with_ties = FALSE）
df |> slice_max(x, n = 1)：取x列值最大的行
df |> slice_sample(n = 1)：随机取一行

注意：n =可以选择多行，也可以使用prop = 0.1代替其用于选择每组10%的行。

# 查找到达每个目的地时延误最严重的航班
flights |> 
  group_by(dest) |> 
  slice_max(arr_delay, n = 1) |> 
  relocate(dest)
#> # A tibble: 108 × 19
#> # Groups:   dest [105]
#>    dest   year month   day dep_time sched_dep_time dep_delay arr_time
#>    <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1 ABQ    2013     7    22     2145           2007        98      132
#>  2 ACK    2013     7    23     1139            800       219     1250
#>  3 ALB    2013     1    25      123           2000       323      229
#>  4 ANC    2013     8    17     1740           1625        75     2042
#>  5 ATL    2013     7    22     2257            759       898      121
#>  6 AUS    2013     7    10     2056           1505       351     2347
#>  7 AVL    2013     8    13     1156            832       204     1417
#>  8 BDL    2013     2    21     1728           1316       252     1839
#>  9 BGR    2013    12     1     1504           1056       248     1628
#> 10 BHM    2013     4    10       25           1900       325      136
#> # ℹ 98 more rows
#> # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

4.4 按照多个变量分组

也可以使用多个变量进行分组：

daily <- flights |> 
  group_by(year, month, day)
daily
#> # A tibble: 336,776 × 19
#> # Groups:   year, month, day [365]
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

当使用summarize功能对多个变量的分组进行汇总时，每个汇总都会删除掉最后的一个组：

daily |> 
  summarize(n = n())
#> `summarise()` has grouped output by 'year', 'month'. You can override using the
#> `.groups` argument.
#> # A tibble: 365 × 4
#> # Groups:   year, month [12]
#>     year month   day     n
#>    <int> <int> <int> <int>
#>  1  2013     1     1   842
#>  2  2013     1     2   943
#>  3  2013     1     3   914
#>  4  2013     1     4   915
#>  5  2013     1     5   720
#>  6  2013     1     6   832
#>  7  2013     1     7   933
#>  8  2013     1     8   899
#>  9  2013     1     9   902
#> 10  2013     1    10   932
#> # ℹ 355 more rows

根据提示信息，可以使用.groups参数进行规则修改。比如.groups = "drop_last"就是删除最后一个分组，.groups = "drop"是删除所有分组，.groups = "keep"是保留原来的分组。

4.5 取消分组

当然，也可以在不适用summarize()的情况下取消原有的分组：

daily |> 
  ungroup()
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

如果此时取汇总未分组的数据框：

daily |> 
  ungroup() |> 
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    flights = n()
  )
#> # A tibble: 1 × 2
#>   avg_delay flights
#>       <dbl>   <int>
#> 1      12.6  336776

它会将所有未分组的数据框中的行视为一组，得到一个单行。

4.6 `.by`

dplyr 1.1.0中包含一种新的语法，用于单个操纵分组，就是.by参数：

flights |> 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE),
    n = n(),
    .by = month
  )
#> # A tibble: 12 × 3
#>    month delay     n
#>    <int> <dbl> <int>
#>  1     1 10.0  27004
#>  2    10  6.24 28889
#>  3    11  5.44 27268
#>  4    12 16.6  28135
#>  5     2 10.8  24951
#>  6     3 13.2  28834
#>  7     4 13.9  28330
#>  8     5 13.0  28796
#>  9     6 20.8  28243
#> 10     7 21.7  29425
#> 11     8 12.6  29327
#> 12     9  6.72 27574

也可以对多个变量进行分组：

flights |> 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE), 
    n = n(),
    .by = c(origin, dest)
  )
#> # A tibble: 224 × 4
#>    origin dest  delay     n
#>    <chr>  <chr> <dbl> <int>
#>  1 EWR    IAH   11.8   3973
#>  2 LGA    IAH    9.06  2951
#>  3 JFK    MIA    9.34  3314
#>  4 JFK    BQN    6.67   599
#>  5 LGA    ATL   11.4  10263
#>  6 EWR    ORD   14.6   6100
#>  7 EWR    FLL   13.5   3793
#>  8 LGA    IAD   16.7   1803
#>  9 JFK    MCO   10.6   5464
#> 10 LGA    ORD   10.7   8857
#> # ℹ 214 more rows

.by适用于所有的dplyr函数，其优点是不需要使用.groups参数来限制分组信息，也不用在完成后使用ungroup()解开分组。

总结

本节学习了数据转换的基础方法，包括了对行、列数据的常用操作以及对数据进行分组分析的方法。另外，我们还介绍了管道函数，该函数可以让我们更加高效简洁的将许多操作串联起来对数据进行复杂操作。这些方法在未来的数据分析中会经常用到，要注意根据不同的需求会选择不同的函数取处理。在学习的时候边学边练习操作，学到后经常运用最终定能融会贯通。