R语言小白学习笔记7—高效的分组操作：dplyr

最新推荐文章于 2024-01-08 13:38:53 发布

LL_2048

最新推荐文章于 2024-01-08 13:38:53 发布

阅读量2.8k

点赞数 1

分类专栏： R语言学习笔记文章标签： r语言大数据

本文链接：https://blog.csdn.net/LL_2048/article/details/113758914

版权

R语言学习笔记专栏收录该内容

14 篇文章 32 订阅

订阅专栏

R语言小白学习笔记7—高效的分组操作：dplyr

笔记链接
学习笔记7—高效的分组操作：dplyr
小结

笔记链接

学习笔记1—R语言基础.
学习笔记2—高级数据结构.
学习笔记3—R语言读取数据.
学习笔记4—统计图.
学习笔记5—编写R语言函数和简单的控制循环语句.
学习笔记6—分组操作.

学习笔记7—高效的分组操作：dplyr

dplyr包中的d代表着数据框，其处理速度更快。

同时使用plyr和dplyr包时，要先加载plyr包，再加载dplyr包，因为其包中含有很多函数具有相同函数名，而后加载的包会有高优先级。

7.1 管道和tbl数据类型

dplyr包不仅处理速度快，而且可以使用magrittr包的管道流程。（管道语法见笔记：R语言基础）

例：将diamonds数据先传入head函数，再将结果传入dim函数（行数列数）

> library(magrittr)
> data(diamonds, package = "ggplot2")
> dim(head(diamonds, n=4))
[1]  4 10
> diamonds %>% head(4) %>% dim
[1]  4 10

tbl特点是当打印数据集时，默认只有一部分行数据会显示，列数据会适应屏幕大小显示。另一个特点是列名下面会显示每行的数据类型。

最新版本的ggplot2包的diamonds数据集存储为tbl_df（tbl对象的扩展）类型。

> class(diamonds)
[1] "tbl_df"     "tbl"        "data.frame"
> head(diamonds)
  carat       cut color clarity depth table price    x
1  0.23     Ideal     E     SI2  61.5    55   326 3.95
2  0.21   Premium     E     SI1  59.8    61   326 3.89
3  0.23      Good     E     VS1  56.9    65   327 4.05
4  0.29   Premium     I     VS2  62.4    58   334 4.20
5  0.31      Good     J     SI2  63.3    58   335 4.34
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94
     y    z
1 3.98 2.43
2 3.84 2.31
3 4.07 2.31
4 4.23 2.63
5 4.35 2.75
6 3.96 2.48

加载dplyr包后，diamonds数据集将以tbl对象打印

因为tbl对象的打印只会显示部分行数据，所以不需要使用head函数：

> library(dplyr)
> diamonds
# A tibble: 53,940 x 10
   carat cut   color clarity depth table price     x
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1 0.23  Ideal E     SI2      61.5    55   326  3.95
 2 0.21  Prem~ E     SI1      59.8    61   326  3.89
 3 0.23  Good  E     VS1      56.9    65   327  4.05
 4 0.290 Prem~ I     VS2      62.4    58   334  4.2 
 5 0.31  Good  J     SI2      63.3    58   335  4.34
 6 0.24  Very~ J     VVS2     62.8    57   336  3.94
 7 0.24  Very~ I     VVS1     62.3    57   336  3.95
 8 0.26  Very~ H     SI1      61.9    55   337  4.07
 9 0.22  Fair  E     VS2      65.1    61   337  3.87
10 0.23  Very~ H     VS1      59.4    61   338  4   
# ... with 53,930 more rows, and 2 more variables:
#   y <dbl>, z <dbl>

7.2 select函数

功能：选择列数据

select函数第一个参数为数据框或tbl对象，后面的参数为待操作的列。

> select(diamonds, carat, price)
# A tibble: 53,940 x 2
   carat price
   <dbl> <int>
 1 0.23    326
 2 0.21    326
 3 0.23    327
 4 0.290   334
 5 0.31    335
 6 0.24    336
 7 0.24    336
 8 0.26    337
 9 0.22    337
10 0.23    338
# ... with 53,930 more rows

列名也可以以向量传入。（书上讲了一部分select_函数内容，但我输入后显示这种方法已经废弃）

如果列名是以变量的形式存储，则需要使用one_of函数：

> theCols <- c('carat', 'price')
> diamonds %>% select(one_of(theCols))
# A tibble: 53,940 x 2
   carat price
   <dbl> <int>
 1 0.23    326
 2 0.21    326
 3 0.23    327
 4 0.290   334
 5 0.31    335
 6 0.24    336
 7 0.24    336
 8 0.26    337
 9 0.22    337
10 0.23    338
# ... with 53,930 more rows

使用中括号语法，列名就可以用位置索引来指定：

> select(diamonds, 1, 7)
# A tibble: 53,940 x 2
   carat price
   <dbl> <int>
 1 0.23    326
 2 0.21    326
 3 0.23    327
 4 0.290   334
 5 0.31    335
 6 0.24    336
 7 0.24    336
 8 0.26    337
 9 0.22    337
10 0.23    338
# ... with 53,930 more rows

dplyr包的starts_with、ends_with和contains函数可以进行部分搜索：

> diamonds %>% select(starts_with('c'))
# A tibble: 53,940 x 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1 0.23  Ideal     E     SI2    
 2 0.21  Premium   E     SI1    
 3 0.23  Good      E     VS1    
 4 0.290 Premium   I     VS2    
 5 0.31  Good      J     SI2    
 6 0.24  Very Good J     VVS2   
 7 0.24  Very Good I     VVS1   
 8 0.26  Very Good H     SI1    
 9 0.22  Fair      E     VS2    
10 0.23  Very Good H     VS1    
# ... with 53,930 more rows

select使用减号(-)来排除列数据：

> diamonds %>% select(-carat, -price)
# A tibble: 53,940 x 8
   cut       color clarity depth table     x     y     z
   <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Ideal     E     SI2      61.5    55  3.95  3.98  2.43
 2 Premium   E     SI1      59.8    61  3.89  3.84  2.31
 3 Good      E     VS1      56.9    65  4.05  4.07  2.31
 4 Premium   I     VS2      62.4    58  4.2   4.23  2.63
 5 Good      J     SI2      63.3    58  4.34  4.35  2.75
 6 Very Good J     VVS2     62.8    57  3.94  3.96  2.48
 7 Very Good I     VVS1     62.3    57  3.95  3.98  2.47
 8 Very Good H     SI1      61.9    55  4.07  4.11  2.53
 9 Fair      E     VS2      65.1    61  3.87  3.78  2.49
10 Very Good H     VS1      59.4    61  4     4.05  2.39
# ... with 53,930 more rows

当使用one_of函数时，减号必须在one_of之前：

> diamonds %>% select(-one_of('carat', 'price'))
# A tibble: 53,940 x 8
   cut       color clarity depth table     x     y     z
   <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Ideal     E     SI2      61.5    55  3.95  3.98  2.43
 2 Premium   E     SI1      59.8    61  3.89  3.84  2.31
 3 Good      E     VS1      56.9    65  4.05  4.07  2.31
 4 Premium   I     VS2      62.4    58  4.2   4.23  2.63
 5 Good      J     SI2      63.3    58  4.34  4.35  2.75
 6 Very Good J     VVS2     62.8    57  3.94  3.96  2.48
 7 Very Good I     VVS1     62.3    57  3.95  3.98  2.47
 8 Very Good H     SI1      61.9    55  4.07  4.11  2.53
 9 Fair      E     VS2      65.1    61  3.87  3.78  2.49
10 Very Good H     VS1      59.4    61  4     4.05  2.39
# ... with 53,930 more rows

7.3 filter函数

filter函数通过逻辑表达式指定行数据。

> diamonds %>% filter(cut == 'Ideal')
# A tibble: 21,551 x 10
   carat cut   color clarity depth table price     x
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95
 2  0.23 Ideal J     VS1      62.8    56   340  3.93
 3  0.31 Ideal J     SI2      62.2    54   344  4.35
 4  0.3  Ideal I     SI2      62      54   348  4.31
 5  0.33 Ideal I     SI2      61.8    55   403  4.49
 6  0.33 Ideal I     SI2      61.2    56   403  4.49
 7  0.33 Ideal J     SI1      61.1    56   403  4.49
 8  0.23 Ideal G     VS1      61.9    54   404  3.93
 9  0.32 Ideal I     SI1      60.9    55   404  4.45
10  0.3  Ideal I     SI2      61      59   405  4.3 
# ... with 21,541 more rows, and 2 more variables:
#   y <dbl>, z <dbl>

filter函数可以使用操作符（%in%）来选择许多可能值中的一个。

> diamonds %>% filter(cut %in% c('Ideal', 'Good'))
# A tibble: 26,457 x 10
   carat cut   color clarity depth table price     x
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95
 2  0.23 Good  E     VS1      56.9    65   327  4.05
 3  0.31 Good  J     SI2      63.3    58   335  4.34
 4  0.3  Good  J     SI1      64      55   339  4.25
 5  0.23 Ideal J     VS1      62.8    56   340  3.93
 6  0.31 Ideal J     SI2      62.2    54   344  4.35
 7  0.3  Ideal I     SI2      62      54   348  4.31
 8  0.3  Good  J     SI1      63.4    54   351  4.23
 9  0.3  Good  J     SI1      63.8    56   351  4.23
10  0.3  Good  I     SI2      63.3    56   351  4.26
# ... with 26,447 more rows, and 2 more variables:
#   y <dbl>, z <dbl>

可以用逗号（,）或者与符号（&）连接进行组合过滤：

> diamonds %>% filter(carat > 2, price < 14000)
# A tibble: 644 x 10
   carat cut   color clarity depth table price     x
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1  2.06 Prem~ J     I1       61.2    58  5203  8.1 
 2  2.14 Fair  J     I1       69.4    57  5405  7.74
 3  2.15 Fair  J     I1       65.5    57  5430  8.01
 4  2.22 Fair  J     I1       66.7    56  5607  8.04
 5  2.01 Fair  I     I1       67.4    58  5696  7.71
 6  2.01 Fair  I     I1       55.9    64  5696  8.48
 7  2.27 Fair  J     I1       67.6    55  5733  8.05
 8  2.03 Fair  H     I1       64.4    59  6002  7.91
 9  2.03 Fair  H     I1       66.6    57  6002  7.81
10  2.06 Good  H     I1       64.3    58  6091  8.03
# ... with 634 more rows, and 2 more variables: y <dbl>,
#   z <dbl>

7.4 slice函数

slice函数通过行数字指定行数据。

需要把想要的索引作为一个向量传入slice函数。
（slice翻译：把…切成(薄)片;切;割）

> diamonds %>% slice(c(1:5, 8, 15:20))
# A tibble: 12 x 10
   carat cut   color clarity depth table price     x
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1 0.23  Ideal E     SI2      61.5    55   326  3.95
 2 0.21  Prem~ E     SI1      59.8    61   326  3.89
 3 0.23  Good  E     VS1      56.9    65   327  4.05
 4 0.290 Prem~ I     VS2      62.4    58   334  4.2 
 5 0.31  Good  J     SI2      63.3    58   335  4.34
 6 0.26  Very~ H     SI1      61.9    55   337  4.07
 7 0.2   Prem~ E     SI2      60.2    62   345  3.79
 8 0.32  Prem~ E     I1       60.9    58   345  4.38
 9 0.3   Ideal I     SI2      62      54   348  4.31
10 0.3   Good  J     SI1      63.4    54   351  4.23
11 0.3   Good  J     SI1      63.8    56   351  4.23
12 0.3   Very~ J     SI1      62.7    59   351  4.21
# ... with 2 more variables: y <dbl>, z <dbl>

注意：打印的行数不是函数中行数字，只是返回结果。

负索引可以显示上边没返回的行：

> diamonds %>% slice(-1)
# A tibble: 53,939 x 10
   carat cut   color clarity depth table price     x
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1 0.21  Prem~ E     SI1      59.8    61   326  3.89
 2 0.23  Good  E     VS1      56.9    65   327  4.05
 3 0.290 Prem~ I     VS2      62.4    58   334  4.2 
 4 0.31  Good  J     SI2      63.3    58   335  4.34
 5 0.24  Very~ J     VVS2     62.8    57   336  3.94
 6 0.24  Very~ I     VVS1     62.3    57   336  3.95
 7 0.26  Very~ H     SI1      61.9    55   337  4.07
 8 0.22  Fair  E     VS2      65.1    61   337  3.87
 9 0.23  Very~ H     VS1      59.4    61   338  4   
10 0.3   Good  J     SI1      64      55   339  4.25
# ... with 53,929 more rows, and 2 more variables:
#   y <dbl>, z <dbl>

7.5 mutate函数

mutate函数能增加新列或修改已经存在的列。
（mutate翻译：转变;转换）

例：通过两变量的比值增加新列并分配列名

（这里用select函数原因是输出尺寸不同可能显示不出来）

> diamonds %>% select(carat, price) %>% mutate(Ratio=price/carat)
# A tibble: 53,940 x 3
   carat price Ratio
   <dbl> <int> <dbl>
 1 0.23    326 1417.
 2 0.21    326 1552.
 3 0.23    327 1422.
 4 0.290   334 1152.
 5 0.31    335 1081.
 6 0.24    336 1400 
 7 0.24    336 1400 
 8 0.26    337 1296.
 9 0.22    337 1532.
10 0.23    338 1470.
# ... with 53,930 more rows

注意：这时并没有改变原始数据集。

若想要保存结果，可以用分配符（%<>%）将左边的对象传给右边的函数，并将返回的结果分配给左边的对象：

> diamonds2 <- diamonds
> diamonds2
# A tibble: 53,940 x 10
   carat cut   color clarity depth table price     x
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1 0.23  Ideal E     SI2      61.5    55   326  3.95
 2 0.21  Prem~ E     SI1      59.8    61   326  3.89
 3 0.23  Good  E     VS1      56.9    65   327  4.05
 4 0.290 Prem~ I     VS2      62.4    58   334  4.2 
 5 0.31  Good  J     SI2      63.3    58   335  4.34
 6 0.24  Very~ J     VVS2     62.8    57   336  3.94
 7 0.24  Very~ I     VVS1     62.3    57   336  3.95
 8 0.26  Very~ H     SI1      61.9    55   337  4.07
 9 0.22  Fair  E     VS2      65.1    61   337  3.87
10 0.23  Very~ H     VS1      59.4    61   338  4   
# ... with 53,930 more rows, and 2 more variables:
#   y <dbl>, z <dbl>
> diamonds2 %<>% select(carat, price) %>% 
+     mutate(Ratio=price/carat, Double=Ratio*2)
> diamonds2
# A tibble: 53,940 x 4
   carat price Ratio Double
   <dbl> <int> <dbl>  <dbl>
 1 0.23    326 1417.  2835.
 2 0.21    326 1552.  3105.
 3 0.23    327 1422.  2843.
 4 0.290   334 1152.  2303.
 5 0.31    335 1081.  2161.
 6 0.24    336 1400   2800 
 7 0.24    336 1400   2800 
 8 0.26    337 1296.  2592.
 9 0.22    337 1532.  3064.
10 0.23    338 1470.  2939.
# ... with 53,930 more rows

7.6 summarize函数

summarize函数适用于返回结果长度为1的函数，比如mean、max、median等等。

例：计算数据集某列平均值

> diamonds %>% summarize(mean(price))
# A tibble: 1 x 1
  `mean(price)`
          <dbl>
1         3933.

summarize函数的一大特点是：可以对结果进行重命名，并且一次调用能完成多种计算：

> diamonds %>% 
+     summarize(AvgPrice=mean(price),
+               MediamPrice=median(price),
+               AvgCarat=mean(carat))
# A tibble: 1 x 3
  AvgPrice MediamPrice AvgCarat
     <dbl>       <dbl>    <dbl>
1    3933.        2401    0.798

7.7 group_by函数

group_by函数能分组数据，并对部分数据单独应用函数。

例：根据某个变量分组，然后对每一部分数据应用summarize函数

> diamonds %>% 
+     group_by(cut) %>%
+     summarize(AvgPrice=mean(price))
# A tibble: 5 x 2
  cut       AvgPrice
* <ord>        <dbl>
1 Fair         4359.
2 Good         3929.
3 Very Good    3982.
4 Premium      4584.
5 Ideal        3458.

7.8 arrange函数

arrange函数可以进行排序，比R基础包中的order和sort函数更易理解

> diamonds %>% 
+     group_by(cut) %>%
+     summarize(AvgPrice=mean(price), SumCarat=sum(carat)) %>%
+     arrange(desc(AvgPrice))
# A tibble: 5 x 3
  cut       AvgPrice SumCarat
  <ord>        <dbl>    <dbl>
1 Premium      4584.   12301.
2 Fair         4359.    1684.
3 Very Good    3982.    9743.
4 Good         3929.    4166.
5 Ideal        3458.   15147.

7.9 do函数

对于dplyr中特定操作函数（如：summarize等）未覆盖的通用目的的计算，可以使用do函数，其可以在数据上应用任意函数。

例：

#创建一个对diamonds数据进行排序并返回前N行的函数
> topN <- function(x, N=5)
+ {
+     x %>% arrange(desc(price)) %>% head(N)
+ }
#结合group_by和do函数返回按变量分组、price变量排序的前N行数据
> diamonds %>% group_by(cut) %>% do(topN(., N=3))
#因为管道左侧没有传入指定的位置，所以可以用.指定位置
# A tibble: 15 x 10
# Groups:   cut [5]
   carat cut   color clarity depth table price     x
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1  2.01 Fair  G     SI1      70.6    64 18574  7.43
 2  2.02 Fair  H     VS2      64.5    57 18565  8   
 3  4.5  Fair  J     I1       65.8    58 18531 10.2 
 4  2.8  Good  G     SI2      63.8    58 18788  8.9 
 5  2.07 Good  I     VS2      61.8    61 18707  8.12
 6  2.67 Good  F     SI2      63.8    58 18686  8.69
 7  2    Very~ G     SI1      63.5    56 18818  7.9 
 8  2    Very~ H     SI1      62.8    57 18803  7.95
 9  2.03 Very~ H     SI1      63      60 18781  8   
10  2.29 Prem~ I     VS2      60.8    60 18823  8.5 
11  2.29 Prem~ I     SI1      61.8    59 18797  8.52
12  2.04 Prem~ H     SI1      58.1    60 18795  8.37
13  1.51 Ideal G     IF       61.7    55 18806  7.37
14  2.07 Ideal G     SI2      62.5    55 18804  8.2 
15  2.15 Ideal G     SI2      62.6    54 18791  8.29
# ... with 2 more variables: y <dbl>, z <dbl>

这里返回的结果是数据框，如果do函数是有名字的参数，则返回结果会变：

> diamonds %>% group_by(cut) %>% do(Top=(topN(., N=3)))
# A tibble: 5 x 2
# Rowwise: 
  cut       Top              
  <ord>     <list>           
1 Fair      <tibble [3 x 10]>
2 Good      <tibble [3 x 10]>
3 Very Good <tibble [3 x 10]>
4 Premium   <tibble [3 x 10]>
5 Ideal     <tibble [3 x 10]>
> topByCut <- diamonds %>% group_by(cut) %>% do(Top=(topN(., N=3)))
> class(topByCut)
[1] "rowwise_df" "tbl_df"     "tbl"        "data.frame"
> topByCut$Top[[1]]
# A tibble: 3 x 10
  carat cut   color clarity depth table price     x     y
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl>
1  2.01 Fair  G     SI1      70.6    64 18574  7.43  6.64
2  2.02 Fair  H     VS2      64.5    57 18565  8     7.95
3  4.5  Fair  J     I1       65.8    58 18531 10.2  10.2 
# ... with 1 more variable: z <dbl>

7.10 dplyr使用数据库

dplyr可以使用数据库存储的数据。

这里拿SOLite数据库举例。

例：

#先下载数据库
> download.file("http://www.jaredlander.com/data/diamonds.db", destfile = "E:/B/R/diamonds.db", mode='wb')
#首先为SQLite数据库创建一个连接。
#这里用src_sqlite函数直接warning了，书上说是因为版本的问题，所以这里用DBI创建连接
> diaDBSource <- src_sqlite("E:/B/R/diamonds.db")
Warning message:
`src_sqlite()` is deprecated as of dplyr 1.0.0.
Please use `tbl()` directly with a database connection
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
> diaDBSource2 <- DBI::dbConnect(RSQLite::SQLite(), "E:/B/R/diamonds.db")
> diaDBSource2
<SQLiteConnection>
  Path: E:\B\R\diamonds.db
  Extensions: TRUE
#接下来我们需要指向某个数据表，这里指向diamonds表
> diaTab <- tbl(diaDBSource2, "diamonds")
> diaTab
# Source:   table<diamonds> [?? x 10]
# Database: sqlite 3.34.1 [E:\B\R\diamonds.db]
   carat cut   color clarity depth table price     x
   <dbl> <chr> <chr> <chr>   <dbl> <dbl> <int> <dbl>
 1 0.23  Ideal E     SI2      61.5    55   326  3.95
 2 0.21  Prem~ E     SI1      59.8    61   326  3.89
 3 0.23  Good  E     VS1      56.9    65   327  4.05
 4 0.290 Prem~ I     VS2      62.4    58   334  4.2 
 5 0.31  Good  J     SI2      63.3    58   335  4.34
 6 0.24  Very~ J     VVS2     62.8    57   336  3.94
 7 0.24  Very~ I     VVS1     62.3    57   336  3.95
 8 0.26  Very~ H     SI1      61.9    55   337  4.07
 9 0.22  Fair  E     VS2      65.1    61   337  3.87
10 0.23  Very~ H     VS1      59.4    61   338  4   
# ... with more rows, and 2 more variables: y <dbl>,
#   z <dbl>