文章目录
Tidy data
#> country year fertility
#> 1 Germany 1960 2.41
#> 2 South Korea 1960 6.16
#> 3 Germany 1961 2.44
#> 4 South Korea 1961 5.99
#> 5 Germany 1962 2.47
#> 6 South Korea 1962 5.79
This is a tidy dataset because each row presents one observation with the three variables being country, year, and fertility rate.
#> country 1960 1961 1962
#> 1 Germany 2.41 2.44 2.47
#> 2 South Korea 6.16 5.99 5.79
The same information is provided, but there are two important differences in the format: 1) each row includes several observations and 2) one of the variables, year, is stored in the header.
For the tidyverse packages to be optimally used, data need to be reshaped into tidy
format.
Exercises
-
Examine the built-in dataset
co2
, which is not tidy: to be tidy we would have to wrangle it to have three columns (year, month and value), then each co2 observation would have a row. -
Examine the built-in dataset
ChickWeight
, which is tidy: each observation (a weight) is represented by one row. The chick from which this measurement came is one of the variables. -
Examine the built-in dataset
BOD
, which is tidy: each row is an observation with two values (time and demand). -
Which of the following built-in datasets is tidy (you can pick more than one):
a. BJsales
b. EuStockMarkets
c. DNase
d. Formaldehyde
e. Orange
f. UCBAdmissions
b-f
Manipulating data frames
For instance, to change the data table by adding a new column, we use mutate
. To filter the data table to a subset of rows, we use filter
. Finally, to subset the data by selecting specific columns, we use select
.
Adding a column with mutate
The function mutate
takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values
.
library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)
Subsetting with filter
To do this we use the filter
function, which takes the data table as the first argument and then the conditional statement as the second.
filter(murders, rate <= 0.71)
Selecting columns with select
If we want to view just a few columns, we can use the dplyr select
function.
new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
Unlike select
which is for columns, filter
is for rows.
Exercises
- Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with filter. Here is an example in which we filter to keep only small states in the Northeast region.
filter(murders, population < 5000000 & region == "Northeast")
Make sure murders
has been defined with rate
and rank
and still has all states. Create a table called my_states
that contains rows for states satisfying both the conditions: it is in the Northeast or West and the murder rate is less than 1. Use select
to show only the state name, the rate, and the rank.
The pipe: %>%
original data → \rightarrow → select → \rightarrow → filter
In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe. So we can define other arguments as if the first argument is already defined
Summarizing data
summarize
The summarize
function in dplyr provides a way to compute summary statistics with intuitive and readable code.
library(dplyr)
library(dslabs)
data(heights)
s <- heights %>%
filter(sex == "Female") %>%
summarize(average = mean(height), standard_deviation = sd(height))
s
#> average standard_deviation
#> 1 64.9 3.76
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum