[Getting and Cleaning data] Week 3

最新推荐文章于 2021-05-15 22:31:37 发布

艳艳儿

最新推荐文章于 2021-05-15 22:31:37 发布

阅读量1.1k

点赞数

分类专栏： data science statistics R coursera 文章标签： R

本文链接：https://blog.csdn.net/COMEYAN/article/details/50893416

版权

R 同时被 3 个专栏收录

27 篇文章 2 订阅

订阅专栏

statistics

20 篇文章 1 订阅

订阅专栏

data science

16 篇文章 0 订阅

订阅专栏

Week 3

For more details, see the html file here.

Week 3

Subsetting and Sorting

Once you have loaded you data into R, what you might want to do is manipulate that data, and so you can set it to be a tidy data set: variables in the columns and observations in the rows, and only the observations that you want to be able to analysis.

Step 1: Subsetting - quich review

set.seed(13435)
X <- data.frame("var1" = sample(1:5), "var2" = sample(6:10), "var3" = sample(11:15))
X <- X[sample(1:5), ]
X$var2[c(1,3)] <- NA
X
# select the first column 
X[ ,1]
X[ ,"var1"]
X$var1
# select both column and rows
X[1:2, 2]
X[1:2, "var2"]
X[1:2, ]$var2

Step 2: Logicals ands and ors

X[(X$var1 <= 3 & X$var3 > 11), ]
X[(X$var1 <= 3 | X$var3 > 15), ]

Step 3: Dealing with missing value “NA”

X[which(X$var2 > 8), ]
X[X$var2 > 8, ]

From the second line, we can see that if the column has NAs, we cannot use logical statement to select observations, and we need to use which command.

Step 4: Sorting

sort(X$var1)
sort(X$var1, decreasing = TRUE)
# Put NA last
sort(X$var2, na.last = TRUE) 
# Put NA first
sort(X$var2, na.last = FALSE) 
# NA removes
sort(X$var2)
# order data frame by variable "var1"
X[order(X$var1), ] 
# order data frame by multiple variable
X[order(X$var1, X$var3), ]

Step 5: Ordering with plyr package

library(plyr)
arrange(X, var1) # 
arrange(X, desc(var1))

Step 6: Adding rows and columns

# one way to add column
X$var4 <- rnorm(5) 
X
# the other way to add column
Y <- cbind(X, rnorm(5))
Y

Summarizing data

Step 1: Getting the data from the website

if(!file.exists("./data")) dir.create("./data")
fileUrl <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/restaurants.csv")
restData <- read.csv("./data/restaurants.csv")

Step 2: Look at a bit of the data

# top 3 rows
head(restData, 3)
# last 3 rows
tail(restData, 3)

Step 3: Make summary

summary(restData)

Step 4: More in depth information

str(restData)

Step 5: Quantiles of quantitative variable

quantile(restData$councilDistrict, na.rm = TRUE)
quantile(restData$councilDistrict, probs = c(0.5, 0.75, 0.9))

Step 6: Make table

# useNA = "ifany" gives you the NO.of NAs, the default removing NAs
table(restData$zipCode, useNA = "ifany")
# two dimensional table
table(restData$councilDistrict, restData$zipCode)

Step 7: Check for missing values

sum(is.na(restData$councilDistrict))
any(is.na(restData$councilDistrict))
all(restData$zipCode > 0)

Step 8: Row and column sums

colSums(is.na(restData))
all(colSums(is.na(restData)) == 0)

Step 9: Values with specific charactor

table(restData$zipCode %in% c("21212"))
table(restData$zipCode %in% c("21212", "21213"))
head(restData[restData$zipCode %in% c("21212", "21213"), ], 3)

Step 10: Cross tables

data("UCBAdmissions")
DF <- as.data.frame(UCBAdmissions)
summary(DF)
# entry stands for sum of Freq
xt <- xtabs(Freq ~ Gender + Admit, data = DF)
xt
# this is different table: entry stands for NO. of cross observation
with(DF, table(Gender, Admit))

Step 11: Flat tables

warpbreaks$replicate <- rep(1:9, len = 54)
xt <- xtabs(breaks ~ ., data = warpbreaks)
xt
ftable(xt)

Step 12: Size of a data size

fakeData <- rnorm(1e5)
object.size(fakeData)
print(object.size(fakeData), units = "MB")

Creating new variables

Why create new variables?

Ofen the raw data won’t have a value you are looking for.
You will need to transform the data to get the values you would like.
Usually you will add those values to the data frame you are working with.
Common variables to create:
- Missingness indicators.
- “Cutting up” quantitative variabels.
- Applying transforms
Step 1: Getting data from website

if(!file.exists("./data")) dir.create("./data")
fileUrl <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/restaurants.csv")
restData <- read.csv("./data/restaurants.csv")

Step 2: Creating sequences

# creating sequence by 
s1 <- seq(1, 10, by = 2)
s1
# creasting sequence length
s2 <- seq(1, 10, length = 3)
s2
# creating sequence base on a vactor
x <- c(1, 3, 8, 25, 100)
seq(along = x)

Step 3: Sebsetting variable

restData$nearMe <- restData$neighborhood %in% c("Roland Park", "Homeland")
table(restData$nearMe)

Step 4: Creating binary variable

restData$zipWrong <- ifelse(restData$zipCode < 0, TRUE, FALSE)
table(restData$zipWrong, restData$zipCode<0)

Step 5: Creating categorical variables

restData$zipGroups <- cut(restData$zipCode, breaks = quantile(restData$zipCode))
table(restData$zipGroups)
table(restData$zipGroups, restData$zipCode)

Step 6: Easier cutting

library(Hmisc)
restData$zipGroups <- cut2(restData$zipCode, g = 4)
table(restData$zipGroups)

Step 7: Creating factor variables

restData$zcf <- factor(restData$zipCode)
#restData$zcf
class(restData$zcf)

Step 8: Levels of factor variables

yesno <- sample(c("yes", "no"), size = 10, replace = TRUE)
yesnofac <- factor(yesno, levels = c("yes", "no"))
yesnofac
relevel(yesnofac, ref = "no")
as.numeric(yesnofac) # name the ref level be 1, the other be 2
as.numeric(relevel(yesnofac, ref = "no"))

Step 9: Cutting produces factor variables

class(restData$zipGroups)

Step 10: Using the mutate function

library(Hmisc)  # cut2
library(plyr)  # mutate
restData2 <- mutate(restData, zipGroups = cut2(zipCode, g = 4))
table(restData2$zipGroups)

Step 11: Common transforms
- abs(x) absolute value
- sqrt(x) square root
- ceiling(x) ceiling(3.14) = 4
- floor(x) floor(3.14) = 3
- round(x, digits = n) round(3.1415926, digits = 2) = 3.14
- signif(x, digits = n) signif(3.1415926, digits = 2) = 3.1
- cos(x) or sin(x) etc.
- log(x) natural logarithm
- log2(x) or log10(x) other common logs.
- exp(x) exponentiating x

Reshaping data

The goal is tidy data:
- Each variable forms a column
- Each observation forms a row
- Each table/file stores data about one kind of observation.

Step 1: Start with reshaping

library(reshape2)
head(mtcars)

Step 2: Melting data frames

mtcars$carname <- rownames(mtcars)
# make multiple columns to one variable
carMelt <- melt(mtcars, id = c("carname", "gear", "cyl"), measure.vars = c("mpg", "hp") )
# carMelt: a tall and skinny data set
head(carMelt, 3)
tail(carMelt, 3)

Step 3: Casting data frame

# work on the melt data set
clyData <- dcast(carMelt, cyl ~ variable)
cylData <- dcast(carMelt, cyl ~ variable, mean)

Step 4: Averaging values

head(InsectSprays)
tapply(InsectSprays$count, InsectSprays$spray, sum)

Step 5: Another way: split

spIns <- split(InsectSprays$count, InsectSprays$spray)
sprCount <- lapply(spIns, sum)
sprCount

Step 6: Another way - combine

unlist(sprCount)
sprCount <- sapply(spIns, sum)
sprCount

Step 9: Another way - plyr package

library(plyr)
## ddply: Split data frame, apply function, and return results in a data frame.
#ddply(InsectSprays, .(spray), summarize, sum = sum(count))
# Error in .fun(piece, ...) : argument "by" is missing, with no default
ddply(InsectSprays, .(spray), plyr::summarize, sum = sum(count))

Step 10: Creating a new variable

spraySums <- ddply(InsectSprays, .(spray), plyr::summarise, sum = ave(count, FUN = sum))

Managing data frame with `dplyr` package – Introduction

The data frame is a key data structure in statistics and in R.

There is one observarion per row.
Each column represents a variable or measure or characteristic.
Primary implementation that you will use is the default R implementation.
Other implementations, particularly relational databases systems.

dplyr

Developed by hadley Wickham of Rstudio.
An optimized and distilled version of plyr package.
Does not provide any “new” functionality per se, but greatly simplifies existing functionality in R.
Provides a “grammar” (in particular verbs) for data manipulation.
Is very fast, as many key operation are coded in C++.

dplyr Verbs:

select: return a subsets of columns of a data frame.
filter: extracts a subsets of rows of a data frame based on a logical condition.
arrange: reorder rows of a data frame.
rename: rename variables in a data frame.
mutate: add new variable/columns or transforming existing variables.
summarize/ summarise: generate summary statistics of different variables in the data frame, possible within strata.

There is also a handy print method that prevents you from printing a lot of data to the console.

Managing data frames with `dplyr` package – Basic tools

Step 1: download data and load data into R

library(dplyr)
options(width = 105)
# download data and load data into R
fileUrl <- "https://raw.github.com/DataScienceSpecialization/courses/master/03_GettingData/dplyr/chicago.rds"
if(!file.exists("./data")) dir.create("./data")
download.file(fileUrl, destfile = "./data/chicago.rds", mode = "wb")
chicago <- readRDS("./data/chicago.Rds")

Step 2: Summary statistics

dim(chicago)
str(chicago)
names(chicago)

Step 3: Select columns using select command.

head(select(chicago, city:dptp))
head(select(chicago, -(city:dptp)))
i <- match("city", names(chicago))
j <- match("dptp", names(chicago))
head(chicago[, -(i:j)])

Step 4: Select rows using filter command.

chic.f <- filter(chicago, pm25tmean2 > 30)
head(chic.f, 3)
chic.f <- filter(chicago, pm25tmean2 >30 & tmpd > 80)
head(chic.f, 3)
chicago <- arrange(chicago, date)
head(chicago, 2)
tail(chicago, 2)

Step 5: Reorder data frame using arrange command.

chicago <- arrange(chicago, desc(date))
head(chicago, 2)
tail(chicago, 2)

Step 6: Rename variables using rename command.

chicago <- rename(chicago, pm25 = pm25tmean2, dewponit = dptp)
head(chicago, 3)

Step 7: Adding new variable using mutate command.

chicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(select(chicago, pm25, pm25detrend), 3)
chicago <- mutate(chicago, tempcat = factor(1*(tmpd >80), labels = c("cold", "hot")))

Step 8: Summarize grouped data using summarize command.

hotcold <- group_by(chicago, tempcat)
head(hotcold)
summarize(hotcold, pm25 = mean(pm25), o3 = max(o3tmean2), no2 = median(no2tmean2))
summarize(hotcold, pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2), no2 = median(no2tmean2))
chicago <- mutate(chicago, year = as.POSIXlt(date)$year+1900)
years <- group_by(chicago, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2), no2 = median(no2tmean2))

Step 9: Making chain command using %>% command.

chicago %>% mutate(month = as.POSIXlt(date)$mon + 1) %>% group_by(month) %>% summarize(pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2), mo2 = median(no2tmean2))
# =
summarize(group_by(mutate(chicago, month = as.POSIXlt(date)$mon+1), month), pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2), mo2 = median(no2tmean2))

Merging data

Step 1: Peer review data

if(!file.exists("./data")) dir.create("./data")
# download data set
fileUrl1 <- "https://dl.dropbox.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 <- "https://dl.dropbox.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1, destfile = "./data/reviews.csv")
download.file(fileUrl2, destfile = "./data/solution.csv")
# load data set
reviews <- read.csv("./data/reviews.csv")
solutions <- read.csv("./data/solution.csv")
# view data set
head(reviews, 2)
head(solutions, 2)

Step 2: Marging data
- Merges data frames
- Important paraeters: x, y, by, by.x, by.y, all

names(reviews)
names(solutions)
mergeData <- merge(reviews, solutions, by.x = "solution_id", by.y = "id", all = TRUE)
head(mergeData)

Step 3: Default - merge all common column names

intersect(names(solutions), names(reviews))
mergeData2 <- merge(reviews, solutions, all = TRUE)
head(mergeData2, 2)

Step 4: Using join in the plyr package.

df1 <- data.frame(id = sample(1:10), x = rnorm(10))
df2 <- data.frame(id = sample(1:10), y = rnorm(10))
arrange(join(df1, df2), id)

Step 5: If you have multiple data frames

df3 <- data.frame(id = sample(1:10), z = rnorm(10))
dfList = list(df1, df2, df3)
join_all(dfList)

艳艳儿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

[Getting and Cleaning data] Week 3

Week 3

Subsetting and Sorting

Summarizing data

Creating new variables

Reshaping data

Managing data frame with dplyr package – Introduction

Managing data frames with dplyr package – Basic tools

Merging data

Managing data frame with `dplyr` package – Introduction

Managing data frames with `dplyr` package – Basic tools