[Getting and Cleaning data] swirl

最新推荐文章于 2018-09-21 16:39:17 发布

艳艳儿

最新推荐文章于 2018-09-21 16:39:17 发布

阅读量2.3k

点赞数 1

分类专栏： data science statistics R coursera 文章标签： R

本文链接：https://blog.csdn.net/COMEYAN/article/details/50918376

版权

本文介绍了使用R包`dplyr`进行数据操纵，包括数据加载、选择、过滤、排序和重排等任务，并展示了如何将数据整理成tidy格式。通过`dplyr`，可以方便地处理数据帧、数据表和数据库。文章还讨论了五种不同类型的数据整洁问题及其解决方法，并使用`lubridate`包处理日期和时间数据。

摘要由CSDN通过智能技术生成

Manipulating Data with dplyr Package
Grouping and Chaining with dplyr package
Tidying Data with tidyr package
Dates and Times with lubridate

More details can be found in the html file here.

Manipulating Data with `dplyr` Package

dplyr is a fast and powerful R package written by Hadley Wickham and Romain Francois. The dplyr philosophy is to have small functions that each do one thing well.

One unique aspect of dplyr is that the same set of tools allow you to work with tabular data from a cariety of sources, including

data frame
data tables
databases
multidimensional arrays
Step 1: download data and load it to R.

if(!file.exists("data")) dir.create("data")
fileUrl <- "http://cran-logs.rstudio.com/2014/2014-07-08.csv.gz"
download.file(fileUrl, "./data/path2csv.csv")
mydf <- read.csv("./data/path2csv.csv", stringsAsFactors = FALSE)

Step 2: summary statistics.

dim(mydf)
head(mydf)

Step 3: library dplyr package.

library(dplyr)
# check version: you need to have version 0.4.0 or later
packageVersion("dplyr")

From now on, we will focus on how to manipulate data with dplyr package.

Step 4: The first step of working with data in dplyr is to load the data into what the package author called a “data frame tbl” or “tbl_df”.

cran <- tbl_df(mydf)

Step 5: The main advantage to using tbl_df over a regular data frame is the printing. The output of tbl_df is much more informative and compact than what we would get if we printed the original data frame(mydf) to the console.(dplyr shows us the first 10 rows of data and only as many as columns as fit neatly in our console. At the bottom, we see the names and classes for any variable for variables that didn’t fit on our screen.)

cran
head(mydf)  # take `head` function to avoid too large data set printing

Then we will focus the five manipulation tasks:

- `select()`
- `filter()`
- `arrange()`
- `mutate()`
- `summarize()`

Step 6: select three variables of cran(we don’t need to use cran$ip_id in dplyr package, so $ can be ignored.)

select(cran, ip_id, package, country)

Step 7: select a sequence of columns.()

select(cran, r_arch:country)

Step 8: throw away one column.(the negative sign in front of time tells us we don’t want the time column.)

select(cran, -time)

Step 9: throw away multiple columns.

select(cran, -(date:size))

Step 10: use filter function to select all rows for which the package variable is equal to “swirl”.

filter(cran, package == "swirl")

Step 11: filter multiple rows.(note that the conditions are separated by commas.)

filter(cran, r_version == "3.1.1", country == "US")

Step 12: filter rows corresponding to users in “IN” running an R version that is less than or equal to “3.0.2”

filter(cran, country == "IN", r_version <= "3.0.2")

Step 13: filter rows corresponding to users in “US” or “IN”.

filter(cran, country == "US" | country == "IN")

Step 14: filter rows for which size is strictly greater than 100500 and r_os equals “linux-gnu”

filter(cran, size > 100500, r_os == "linux-gnu")

Step 15: filter the rows for which the r_version is not missing.

filter(cran, !is.na(r_version))

Step 16: Some times we want to reorder the rows of a dataset according to the value of a particular variable. Reorder cran2 such that ip_id is in ascending.

cran2 <- select(cran, size:ip_id)
arrange(cran2, ip_id)

Step 17: reorder crans such that ip_id is in descending order.

arrange(cran2, desc(ip_id))

Step 18: reorder cran2 using multiple variables.

arrange(cran2, package, ip_id)

Step 19: reorder cran2 using this order: country(ascending), r_version(descending), ip_id(ascending).

arrange(cran2, country, desc(r_version), ip_id)

Step 20: add a column called size_mb that contains the download size in megabytes.

cran3 <- select(cran, ip_id, package, size)
cran3
mutate(cran3, size_mb = size / 2^20)

Step 21: One very nice feature of mutate() is that you can use the value computed for your second column(size_mb) to create a third column(size_rb).

mutate(cran3, size_mb = size/2^20, size_gb = size_mb/2^10)

Step 22: add a new variable “correct_size = size+1000”

mutate(cran3, correct_size = size + 1000)

Step 23: summarize() collapses the dataset to a single row. Calculate the average download size.

summarize(cran, ave_bytes = mean(size))

Grouping and Chaining with `dplyr` package

The main idea behind grouping data is that you want to break up your dataset into groups of rows based on the values of one or more variables. THe group_by() function is reponsible for doing this.

Step 1: group cran by the package variable and store the result in a new variable called by_package. (At the top of the output above, you’ll see ‘Groups:package’. Eveything looks the same, but now any operation we apply to the grouped data will tabke place on a per package basis.)

by_package <- group_by(cran, package)
by_package

Step 2: summarize the mean(size) to by_package

最低0.47元/天解锁文章

[Getting and Cleaning data] swirl

Manipulating Data with dplyr Package

Grouping and Chaining with dplyr package

Manipulating Data with `dplyr` Package

Grouping and Chaining with `dplyr` package