i2ds——tidyverse笔记

最新推荐文章于 2024-01-29 22:46:36 发布

零级伪码农

最新推荐文章于 2024-01-29 22:46:36 发布

阅读量634

点赞数

分类专栏：笔记文章标签： r语言数据分析

本文链接：https://blog.csdn.net/weixin_46585008/article/details/109667901

版权

这篇博客探讨了tidyverse包在R语言中处理数据的重要性和使用方法，包括将数据转换为tidy格式，使用dplyr进行数据框操作如添加列、子集选择、排序，通过管道操作符简化流程，以及使用group_by和summarize进行数据概括。此外，还介绍了tibbles的特性，如其更友好的显示方式和对复杂条目的支持，以及purrr包中的函数如何增强数据处理能力。

摘要由CSDN通过智能技术生成

Tidy data

#>       country year fertility
#> 1     Germany 1960      2.41
#> 2 South Korea 1960      6.16
#> 3     Germany 1961      2.44
#> 4 South Korea 1961      5.99
#> 5     Germany 1962      2.47
#> 6 South Korea 1962      5.79

This is a tidy dataset because each row presents one observation with the three variables being country, year, and fertility rate.

#>       country 1960 1961 1962
#> 1     Germany 2.41 2.44 2.47
#> 2 South Korea 6.16 5.99 5.79

The same information is provided, but there are two important differences in the format: 1) each row includes several observations and 2) one of the variables, year, is stored in the header.

For the tidyverse packages to be optimally used, data need to be reshaped into tidy format.

Exercises

Examine the built-in dataset co2, which is not tidy: to be tidy we would have to wrangle it to have three columns (year, month and value), then each co2 observation would have a row.
Examine the built-in dataset ChickWeight, which is tidy: each observation (a weight) is represented by one row. The chick from which this measurement came is one of the variables.
Examine the built-in dataset BOD, which is tidy: each row is an observation with two values (time and demand).
Which of the following built-in datasets is tidy (you can pick more than one):

a. BJsales
b. EuStockMarkets
c. DNase
d. Formaldehyde
e. Orange
f. UCBAdmissions

b-f

Manipulating data frames

For instance, to change the data table by adding a new column, we use mutate. To filter the data table to a subset of rows, we use filter. Finally, to subset the data by selecting specific columns, we use select.

Adding a column with `mutate`

The function mutate takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values.

library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)

Subsetting with `filter`

To do this we use the filter function, which takes the data table as the first argument and then the conditional statement as the second.

filter(murders, rate <= 0.71)

Selecting columns with `select`

If we want to view just a few columns, we can use the dplyr select function.

new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)

Unlike select which is for columns, filter is for rows.

Exercises

Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with filter. Here is an example in which we filter to keep only small states in the Northeast region.

filter(murders, population < 5000000 & region == "Northeast")

Make sure murders has been defined with rate and rank and still has all states. Create a table called my_states that contains rows for states satisfying both the conditions: it is in the Northeast or West and the murder rate is less than 1. Use select to show only the state name, the rate, and the rank.

The pipe: `%>%`

original data $\rightarrow$ select $\rightarrow$ filter
In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe. So we can define other arguments as if the first argument is already defined

Summarizing data

`summarize`

The summarize function in dplyr provides a way to compute summary statistics with intuitive and readable code.

library(dplyr)
library(dslabs)
data(heights)

s <- heights %>% 
  filter(sex == "Female") %>%
  summarize(average = mean(height), standard_deviation = sd(height))
s
#>   average standard_deviation
#> 1    64.9               3.76

us_murder_rate <- murders %>%
  summarize(rate = sum(total) / sum

最低0.47元/天解锁文章

零级伪码农

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

i2ds——tidyverse笔记

文章目录

Tidy data

Exercises

Manipulating data frames

Adding a column with mutate

Subsetting with filter

Selecting columns with select

Exercises

The pipe: %>%

Summarizing data

summarize

Adding a column with `mutate`

Subsetting with `filter`

Selecting columns with `select`

The pipe: `%>%`

`summarize`