- Manipulating Data with dplyr Package
- Grouping and Chaining with dplyr package
- Tidying Data with tidyr package
- Type one column headers are values not variable names
- Type two multiple variables are stored in one columnm
- Type three Variables are stored in both rows and columns
- Type four multiple types of observational units are stored in the same table
- Type five a single observation unit is stored in multiple tables
- Brings five type together to deal with real data
- Dates and Times with lubridate
More details can be found in the html file here.
Manipulating Data with dplyr
Package
dplyr
is a fast and powerful R package written by Hadley Wickham and Romain Francois. The dplyr
philosophy is to have small functions that each do one thing well.
One unique aspect of dplyr
is that the same set of tools allow you to work with tabular data from a cariety of sources, including
- data frame
- data tables
- databases
multidimensional arrays
Step 1: download data and load it to R.
if(!file.exists("data")) dir.create("data")
fileUrl <- "http://cran-logs.rstudio.com/2014/2014-07-08.csv.gz"
download.file(fileUrl, "./data/path2csv.csv")
mydf <- read.csv("./data/path2csv.csv", stringsAsFactors = FALSE)
- Step 2: summary statistics.
dim(mydf)
head(mydf)
- Step 3: library
dplyr
package.
library(dplyr)
# check version: you need to have version 0.4.0 or later
packageVersion("dplyr")
From now on, we will focus on how to manipulate data with dplyr
package.
- Step 4: The first step of working with data in
dplyr
is to load the data into what the package author called a “data frame tbl” or “tbl_df”.
cran <- tbl_df(mydf)
- Step 5: The main advantage to using
tbl_df
over a regular data frame is the printing. The output oftbl_df
is much more informative and compact than what we would get if we printed the original data frame(mydf) to the console.(dplyr
shows us the first 10 rows of data and only as many as columns as fit neatly in our console. At the bottom, we see the names and classes for any variable for variables that didn’t fit on our screen.)
cran
head(mydf) # take `head` function to avoid too large data set printing
Then we will focus the five manipulation tasks:
- `select()`
- `filter()`
- `arrange()`
- `mutate()`
- `summarize()`
- Step 6: select three variables of cran(we don’t need to use
cran$ip_id
indplyr
package, so$
can be ignored.)
select(cran, ip_id, package, country)
- Step 7: select a sequence of columns.()
select(cran, r_arch:country)
- Step 8: throw away one column.(the negative sign in front of time tells us we don’t want the time column.)
select(cran, -time)
- Step 9: throw away multiple columns.
select(cran, -(date:size))
- Step 10: use
filter
function to select all rows for which the package variable is equal to “swirl”.
filter(cran, package == "swirl")
- Step 11: filter multiple rows.(note that the conditions are separated by commas.)
filter(cran, r_version == "3.1.1", country == "US")
- Step 12: filter rows corresponding to users in “IN” running an R version that is less than or equal to “3.0.2”
filter(cran, country == "IN", r_version <= "3.0.2")
- Step 13: filter rows corresponding to users in “US” or “IN”.
filter(cran, country == "US" | country == "IN")
- Step 14: filter rows for which size is strictly greater than 100500 and r_os equals “linux-gnu”
filter(cran, size > 100500, r_os == "linux-gnu")
- Step 15: filter the rows for which the r_version is not missing.
filter(cran, !is.na(r_version))
- Step 16: Some times we want to reorder the rows of a dataset according to the value of a particular variable. Reorder cran2 such that ip_id is in ascending.
cran2 <- select(cran, size:ip_id)
arrange(cran2, ip_id)
- Step 17: reorder crans such that ip_id is in descending order.
arrange(cran2, desc(ip_id))
- Step 18: reorder cran2 using multiple variables.
arrange(cran2, package, ip_id)
- Step 19: reorder cran2 using this order:
country(ascending), r_version(descending), ip_id(ascending)
.
arrange(cran2, country, desc(r_version), ip_id)
- Step 20: add a column called
size_mb
that contains the download size in megabytes.
cran3 <- select(cran, ip_id, package, size)
cran3
mutate(cran3, size_mb = size / 2^20)
- Step 21: One very nice feature of
mutate()
is that you can use the value computed for your second column(size_mb)
to create a third column(size_rb)
.
mutate(cran3, size_mb = size/2^20, size_gb = size_mb/2^10)
- Step 22: add a new variable “correct_size = size+1000”
mutate(cran3, correct_size = size + 1000)
- Step 23:
summarize()
collapses the dataset to a single row. Calculate the average download size.
summarize(cran, ave_bytes = mean(size))
Grouping and Chaining with dplyr
package
The main idea behind grouping data is that you want to break up your dataset into groups of rows based on the values of one or more variables. THe group_by()
function is reponsible for doing this.
- Step 1: group cran by the package variable and store the result in a new variable called by_package. (At the top of the output above, you’ll see ‘Groups:package’. Eveything looks the same, but now any operation we apply to the grouped data will tabke place on a per package basis.)
by_package <- group_by(cran, package)
by_package
- Step 2: summarize the mean(size) to by_package