R for Data Science（笔记） ---数据变换（select基础使用）

最新推荐文章于 2024-02-10 09:11:59 发布

生信小鹏

最新推荐文章于 2024-02-10 09:11:59 发布

阅读量208

点赞数 3

分类专栏： R for Data Science

本文链接：https://blog.csdn.net/lijianpeng0302/article/details/119852705

版权

R for Data Science 专栏收录该内容

5 篇文章

订阅专栏

本文介绍了R语言tidyverse包中的select函数，用于根据列名筛选数据。select操作可以选取指定列、连续列、排除列，还可以结合布尔运算符进行复杂的选择。文中通过实例展示了如何使用select配合其他函数实现数据列的选择，包括选择单个或多个列、范围列、排除列以及使用布尔运算符进行交集和并集操作。

摘要由CSDN通过智能技术生成

R for Data Science

tidy流处理数据在科研中使用比较充分，我想这与管道符%>% 的使用，数据处理动词化，有着很重要的关系。

用最少的时间，解决最重要的、最常见的问题，我把这称为是高效；剩余的难点，我把其称为提高。

select动词的使用

首先需要明确的是

filter针对的是行的操作， select是针对列的操作

前面学习filter的操作，这次学习select操作

###实战

再次强调，select是通过列名进行筛选，并且列名不需要加引号。
###1. 数据样式
依然采用nycflights13 包中的数据进行演示

flights
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

###2. 筛选数据

select 筛选数据可以使用单个列名，也可以使用顺序符号，也可以采用“-”

# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#>   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#>      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
#> 1      517            515         2      830            819        11 UA     
#> 2      533            529         4      850            830        20 UA     
#> 3      542            540         2      923            850        33 AA     
#> 4      544            545        -1     1004           1022       -18 B6     
#> 5      554            600        -6      812            837       -25 DL     
#> 6      554            558        -4      740            728        12 UA     
#> # … with 336,770 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

###3. 拓展1（布尔运算）

“:” 用于选择一系列连续变量。
“!” 取一组变量的补集。
“&” 和 “|” 用于选择两组变量的交集或并集。
“c()” 用于组合选择

这里我们使用starwas， iris这两个数据集演示

starwars %>% select(name:mass)
#> # A tibble: 87 x 3
#>   name           height  mass
#>   <chr>           <int> <dbl>
#> 1 Luke Skywalker    172    77
#> 2 C-3PO             167    75
#> 3 R2-D2              96    32
#> 4 Darth Vader       202   136
#> # ... with 83 more rows

“!" 运算符否定选择：

starwars %>% select(!(name:mass))
#> # A tibble: 87 x 11
#>   hair_color skin_color  eye_color birth_year sex   gender    homeworld species films     vehicles  starships
#>   <chr>      <chr>       <chr>          <dbl> <chr> <chr>     <chr>     <chr>   <list>    <list>    <list>   
#> 1 blond      fair        blue            19   male  masculine Tatooine  Human   <chr [5]> <chr [2]> <chr [2]>
#> 2 <NA>       gold        yellow         112   none  masculine Tatooine  Droid   <chr [6]> <chr [0]> <chr [0]>
#> 3 <NA>       white, blue red             33   none  masculine Naboo     Droid   <chr [7]> <chr [0]> <chr [0]>
#> 4 none       white       yellow          41.9 male  masculine Tatooine  Human   <chr [4]> <chr [0]> <chr [1]>
#> # ... with 83 more rows

iris %>% select(!c(Sepal.Length, Petal.Length))
#> # A tibble: 150 x 3
#>   Sepal.Width Petal.Width Species
#>         <dbl>       <dbl> <fct>  
#> 1         3.5         0.2 setosa 
#> 2         3           0.2 setosa 
#> 3         3.2         0.2 setosa 
#> 4         3.1         0.2 setosa 
#> # ... with 146 more rows


iris %>% select(!ends_with("Width"))
#> # A tibble: 150 x 3
#>   Sepal.Length Petal.Length Species
#>          <dbl>        <dbl> <fct>  
#> 1          5.1          1.4 setosa 
#> 2          4.9          1.4 setosa 
#> 3          4.7          1.3 setosa 
#> 4          4.6          1.5 setosa 
#> # ... with 146 more rows

“&” 和 “|” 取两个选择的交集或并集：

iris %>% select(starts_with("Petal") & ends_with("Width"))
#> # A tibble: 150 x 1
#>   Petal.Width
#>         <dbl>
#> 1         0.2
#> 2         0.2
#> 3         0.2
#> 4         0.2
#> # ... with 146 more rows

iris %>% select(starts_with("Petal") | ends_with("Width"))
#> # A tibble: 150 x 3
#>   Petal.Length Petal.Width Sepal.Width
#>          <dbl>       <dbl>       <dbl>
#> 1          1.4         0.2         3.5
#> 2          1.4         0.2         3  
#> 3          1.3         0.2         3.2
#> 4          1.5         0.2         3.1
#> # ... with 146 more rows

组合使用

iris %>% select(starts_with("Petal") & !ends_with("Width"))
#> # A tibble: 150 x 1
#>   Petal.Length
#>          <dbl>
#> 1          1.4
#> 2          1.4
#> 3          1.3
#> 4          1.5
#> # ... with 146 more rows