[Getting and Cleaning data] swirl

本文介绍了使用R包`dplyr`进行数据操纵,包括数据加载、选择、过滤、排序和重排等任务,并展示了如何将数据整理成tidy格式。通过`dplyr`,可以方便地处理数据帧、数据表和数据库。文章还讨论了五种不同类型的数据整洁问题及其解决方法,并使用`lubridate`包处理日期和时间数据。
摘要由CSDN通过智能技术生成

More details can be found in the html file here.

Manipulating Data with dplyr Package

dplyr is a fast and powerful R package written by Hadley Wickham and Romain Francois. The dplyr philosophy is to have small functions that each do one thing well.

One unique aspect of dplyr is that the same set of tools allow you to work with tabular data from a cariety of sources, including

  • data frame
  • data tables
  • databases
  • multidimensional arrays

  • Step 1: download data and load it to R.

if(!file.exists("data")) dir.create("data")
fileUrl <- "http://cran-logs.rstudio.com/2014/2014-07-08.csv.gz"
download.file(fileUrl, "./data/path2csv.csv")
mydf <- read.csv("./data/path2csv.csv", stringsAsFactors = FALSE)
  • Step 2: summary statistics.
dim(mydf)
head(mydf)
  • Step 3: library dplyr package.
library(dplyr)
# check version: you need to have version 0.4.0 or later
packageVersion("dplyr")

From now on, we will focus on how to manipulate data with dplyr package.

  • Step 4: The first step of working with data in dplyr is to load the data into what the package author called a “data frame tbl” or “tbl_df”.
cran <- tbl_df(mydf)
  • Step 5: The main advantage to using tbl_df over a regular data frame is the printing. The output of tbl_df is much more informative and compact than what we would get if we printed the original data frame(mydf) to the console.(dplyr shows us the first 10 rows of data and only as many as columns as fit neatly in our console. At the bottom, we see the names and classes for any variable for variables that didn’t fit on our screen.)
cran
head(mydf)  # take `head` function to avoid too large data set printing 

Then we will focus the five manipulation tasks:

- `select()`
- `filter()`
- `arrange()`
- `mutate()`
- `summarize()`
  • Step 6: select three variables of cran(we don’t need to use cran$ip_id in dplyr package, so $ can be ignored.)
select(cran, ip_id, package, country)
  • Step 7: select a sequence of columns.()
select(cran, r_arch:country)
  • Step 8: throw away one column.(the negative sign in front of time tells us we don’t want the time column.)
select(cran, -time)
  • Step 9: throw away multiple columns.
select(cran, -(date:size))
  • Step 10: use filter function to select all rows for which the package variable is equal to “swirl”.
filter(cran, package == "swirl")
  • Step 11: filter multiple rows.(note that the conditions are separated by commas.)
filter(cran, r_version == "3.1.1", country == "US")
  • Step 12: filter rows corresponding to users in “IN” running an R version that is less than or equal to “3.0.2”
filter(cran, country == "IN", r_version <= "3.0.2")
  • Step 13: filter rows corresponding to users in “US” or “IN”.
filter(cran, country == "US" | country == "IN")
  • Step 14: filter rows for which size is strictly greater than 100500 and r_os equals “linux-gnu”
filter(cran, size > 100500, r_os == "linux-gnu")
  • Step 15: filter the rows for which the r_version is not missing.
filter(cran, !is.na(r_version))
  • Step 16: Some times we want to reorder the rows of a dataset according to the value of a particular variable. Reorder cran2 such that ip_id is in ascending.
cran2 <- select(cran, size:ip_id)
arrange(cran2, ip_id)
  • Step 17: reorder crans such that ip_id is in descending order.
arrange(cran2, desc(ip_id))
  • Step 18: reorder cran2 using multiple variables.
arrange(cran2, package, ip_id)
  • Step 19: reorder cran2 using this order: country(ascending), r_version(descending), ip_id(ascending).
arrange(cran2, country, desc(r_version), ip_id)
  • Step 20: add a column called size_mb that contains the download size in megabytes.
cran3 <- select(cran, ip_id, package, size)
cran3
mutate(cran3, size_mb = size / 2^20)
  • Step 21: One very nice feature of mutate() is that you can use the value computed for your second column(size_mb) to create a third column(size_rb).
mutate(cran3, size_mb = size/2^20, size_gb = size_mb/2^10)
  • Step 22: add a new variable “correct_size = size+1000”
mutate(cran3, correct_size = size + 1000)
  • Step 23: summarize() collapses the dataset to a single row. Calculate the average download size.
summarize(cran, ave_bytes = mean(size))

Grouping and Chaining with dplyr package

The main idea behind grouping data is that you want to break up your dataset into groups of rows based on the values of one or more variables. THe group_by() function is reponsible for doing this.

  • Step 1: group cran by the package variable and store the result in a new variable called by_package. (At the top of the output above, you’ll see ‘Groups:package’. Eveything looks the same, but now any operation we apply to the grouped data will tabke place on a per package basis.)
by_package <- group_by(cran, package)
by_package
  • Step 2: summarize the mean(size) to by_package
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值