R for Data Science总结之——dplyr

本文详细介绍了R语言中的dplyr包,包括filter()、arrange()、select()、mutate()、summarise()和pipe %>%等核心函数的使用方法,以及在处理数据时涉及的缺失值、排序、选择列、创建新变量和统计总结等操作。通过实例展示了如何在数据科学中高效地进行数据操作。
摘要由CSDN通过智能技术生成

R for Data Science总结之——dplyr

dplyr是R语言中一个非常流行地应用于数据处理的包,其功能包含普通SQL语言的增删改查以及统筹计算等,本文测试数据集用的是nycflights13::flights, 加载代码为:

library(dplyr)
library(nycflights13)

这里也建议直接使用Hadley Wickham的一套数据处理包tidyverse,其中包含了dplyr, purrr, tidyr, tibble, ggplot2等常用的数据处理包:

library(tidyverse)

filter()

filter()函数主要用于筛选一个数据级中满足某条件的子数据集,使用方法为:

filter(flights, month == 1, day ==1) 
jan1 <- filter(flights, month == 1, day == 1)
(dec25 <- filter(flights, month == 12, day == 25))

在逻辑表达式中需注意:

sqrt(2) ^ 2 == 2
#> [1] FALSE
1 / 49 * 49 == 1
#> [1] FALSE

这是因为计算机存储的值是有限的,这里"=="不适用,推荐near()函数

near(sqrt(2) ^ 2,  2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

同时"%in%"表达式可用于表示在一个集合中,如以下代码结果是一致的:

filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))

"!"表达式用于表示相反的结果,以下代码结果一致:

filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

另外对于缺失值NA,推荐使用函数is.na()进行判断,不可使用"=="等逻辑表达式

arrange()

arrange()函数可用于数据集排序,如:

arrange(flights, year, month, day)
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     1      517            515         2      830
#> 2  2013     1     1      533            529         4      850
#> 3  2013     1     1      542            540         2      923
#> 4  2013     1     1      544            545        -1     1004
#> 5  2013     1     1      554            600        -6      812
#> 6  2013     1     1      554            558        -4      740
#> # ... with 3.368e+05 more rows, and 12 more variables:
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
What exactly is data science? With this book, you’ll gain a clear understanding of this discipline for discovering natural laws in the structure of data. Along the way, you’ll learn how to use the versatile R programming language for data analysis. Whenever you measure the same thing twice, you get two results—as long as you measure precisely enough. This phenomenon creates uncertainty and opportunity. Author Garrett Grolemund, Master Instructor at RStudio, shows you how data science can help you work with the uncertainty and capture the opportunities. You’ll learn about: Data Wrangling—how to manipulate datasets to reveal new information Data Visualization—how to create graphs and other visualizations Exploratory Data Analysis—how to find evidence of relationships in your measurements Modelling—how to derive insights and predictions from your data Inference—how to avoid being fooled by data analyses that cannot provide foolproof results Through the course of the book, you’ll also learn about the statistical worldview, a way of seeing the world that permits understanding in the face of uncertainty, and simplicity in the face of complexity. Table of Contents Part I. Explore Chapter 1. Data Visualization with ggplot2 Chapter 2. Workflow: Basics Chapter 3. Data Transformation with dplyr Chapter 4. Workflow: Scripts Chapter 5. Exploratory Data Analysis Chapter 6. Workflow: Projects Part II. Wrangle Chapter 7. Tibbles with tibble Chapter 8. Data Import with readr Chapter 9. Tidy Data with tidyr Chapter 10. Relational Data with dplyr Chapter 11. Strings with stringr Chapter 12. Factors with forcats Chapter 13. Dates and Times with lubridate Part III. Program Chapter 14. Pipes with magrittr Chapter 15. Functions Chapter 16. Vectors Chapter 17. Iteration with purrr Part IV. Model Chapter 18. Model Basics with modelr Chapter 19. Model Building Chapter 20. Many Models with purrr and broom Part V. Communicate Chapter 21. R Markdown Chapter 22. Graphics for Communication with ggplot2 Chapter 23. R Markdown Formats Chapter 24. R Markdown Workflow
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值