[Getting and Cleaning data] Week 4

最新推荐文章于 2021-05-15 22:31:37 发布

艳艳儿

最新推荐文章于 2021-05-15 22:31:37 发布

阅读量959

点赞数

分类专栏： data science statistics R coursera 文章标签： r

本文链接：https://blog.csdn.net/COMEYAN/article/details/50910729

版权

R 同时被 3 个专栏收录

27 篇文章 2 订阅

订阅专栏

statistics

20 篇文章 1 订阅

订阅专栏

data science

16 篇文章 0 订阅

订阅专栏

Week 4

More details could be found in the html file here

Week 4

Editing text variables

Important points about text in data set

Names of variables should be
- All lower cases when possible
- Descriptive (Diagnosis versus Dx)
- Not duplicated
- Not have underscores or dots or white spaces
Variables with caracter values
- Should usually be made into factor variables(depend on application)
- Should be descriptive(use TRUE/FALSE instead of 0/1 and Male/Femal versus 0/2 or M/F)
Step 1: Fixing charactre vectors topupper and tolower functions.

if(!file.exists("./data")) dir.create("./data")
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/cameras.csv")
cameraData <- read.csv("./data/cameras.csv")
names(cameraData)
tolower(names(cameraData))

Step 2: Fixing character vectors strsplit function.
- Good for automatically splitting variable names.
- Important paramters:x and split

splitNames <- strsplit(names(cameraData), "\\.")
splitNames[[5]]
splitNames[[6]]

Step 3: Quick aside lists

myList <- list(letters = c("A", "b", "c"), numbers = 1:3, matrix(1:15, 5))
head(myList)

Step 4: Fixing character vectors sapply
- Applies a function to each element in a vector or list.
- Implortant parameted: x Fun

splitNames[[6]][1]
firstElement <- function(x) x[1]
sapply(splitNames, firstElement)

Step 5: Peer review data

if(!file.exists("./data")) dir.create("./data")
# download data set
fileUrl1 <- "https://dl.dropbox.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 <- "https://dl.dropbox.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1, destfile = "./data/reviews.csv")
download.file(fileUrl2, destfile = "./data/solution.csv")
# load data set
reviews <- read.csv("./data/reviews.csv")
solutions <- read.csv("./data/solution.csv")
# view data set
head(reviews, 2)
head(solutions, 2)

Step 6: Fixing character vectors sub()(replace the first match)

names(reviews)
sub("_", "", names(reviews))

Step 7: Fixing character vectors gsub() (replace globally)

testName <- "this_is_a_test"
sub("_", "", testName)
gsub("_", "", testName)

Step 8: Find values grep() and grepl() functions

grep("Alameda", cameraData$intersection) # return index
table(grepl("Alameda", cameraData$intersection)) # return true or false
cameraData2 <- cameraData[!grepl("Alameda", cameraData$intersection), ]

Step 9: More on grep()

grep("Alameda", cameraData$intersection, value = TRUE) # retrun names containing "Aladema"
grep("JeffStreet", cameraData$intersection)
length(grep("JeffStreet", cameraData$intersection))

Step 10: More useful string functions

library(stringr)
nchar("Jeffrey Leek")
substr("jeffrey Leek", 1, 7)
paste("Jeffrey", "Leek")
paste0("Jeffrey", "Leek")
str_trim("Jeff    ")

Regular expressions

Regular expressions:

A ‘regular expression’ is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression.
Here we cansider the extended regular expressions used in grep, grepl, regexpr, gregexpr, sub, gsub and strsplit.
Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ?, but note that whether these have a special meaning depends on the context.
Positions
- 1: ^ matches the begining.
- 2: $ matches the end.
- 3: \b matches the empty string at either edge of a word.
- 4: \B matches the empty string provided it is not at an edge of a word.
Quantifiers
- 1: * matches at least 0 times.
- 2: + matches at least 1 times.
- 3: ? matches at most 1 times.
- 4: {m} matches exactly m times.
- 5: {m.} matches at least m times.
- 6: {n, m} matches between n to m times.
Others:
- 1: [ ] matches any character appearing in []. ex: [a-z]
- 2: [^ ] matches any character not appearing in [ ].
- 3: . matches any character.
- 4: | matches alternative metacharacters.
- 5: \ suppress the special meaning of metacharacters in regular expression.
- 6: () groups expression.
Character classes:
- 1: [:digit:] or \d equivalent to [0-9].
- 2: [:lower:] equivalent to [a-z].
- 3: [:upper:] equivalent to [A-Z].
- 4: [:alpha:] equivalent to [a-zA-Z] or [[:lower:][:upper:]].
- 5: [:alnum:] equivalent to [A-z0-9] or [[:digit:][:alpha:]].
- 6: \w equivalent to [[:apnum]_] or [A-z0-9_].
- 7: \W equivalent [^A-z0-9].
- 8: [:xdigit:] matches 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.
- 9: [:blank:] matches space or tab.
- 10: [:space:] marches tab, newline, vertical tab, form feed, carriage return, space.
- 11: \s space ” “.
- 12: \S not space.
- 13: [:punct] matches ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.
- 14: [:graph:] equivalent to [[:alnum:][:punct:]].
- 15: [:print:] equivalent to [[:alnum:][:punct:]\\s].
- 16: [:cntrl:] control characters, like \n or \r, [\x00-\x1F\x7F].

R function summary:

1: Identify match to a pattern: grep(..., value = FALSE), grepl(), stringr::str_detect().
2: Extract match to a pattern: grep(..., value = TRUE), stringr::str_extract(), stringr::str_extract_all().
3: Locate pattern within a string, i.e. give the start position of matched patterns. regexpr(), gregexpr(), stringr::str_locate(), string::str_locate_all().
4: Replace a pattern: sub(), gsub(), stringr::str_replace(), stringr::str_replace_all().
5: Split a string using a pattern: strsplit(), stringr::str_split().

Working with dates

Step 1: Starting simple. date() returns a character that gives you the date and time.

d1 <- date()
d1
class(d1)

Step 2: Data class.

d2 <- Sys.Date()
d2
class(d2)

Step 3: Formatting dates.
- %d = days as number(0-31).
- %a = abbreviated weekday.
- %A = unabbreviated weekday.
- %m = month(00-12).
- %b = abbreviated month.
- %B = unabbreviated month.
- %y = 2 digit year.
- %Y = 4 digit year.

format(d2, "%a %b %d")

Step 4: Creating dates.

# if returns NA, please use
lct <- Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME", "C")
x <- c("1jan1960", "2jan1960", "31mar1960", "30Jul1960")
z <- as.Date(x, "%d%b%Y")
z
z[1] - z[2]
as.numeric(z[1] - z[2])

Step 5: Converting to Julian.

weekdays(d2)
months(d2)
julian(d2)

Step 6: lubridate package.

library(lubridate)
ymd("20140108")
mdy("08/04/2013")
dmy("03-04-2013")

Step 7: Dealing with time.

ymd_hms("2011-08-03 10:15:03")
ymd_hms("2011-08-03 10:15:03", tz = "Pacific/Auckland")

Step 8: Some functions have slightly different syntax.

x <- dmy(c("1jan2013", "2jan2013", "31mar2013", "30Jul2013"))
wday(x[1])
wday(x[1], label = TRUE)
ymd("1989 May 17")
mdy("March 12 1975")
dmy(25081985)
ymd("1920/1/2")
ymd_hms(now())
hms("03:22:14")

Step 9: Dealing with vector of dates.

dt2 <- c("2014-05-14", "2014-09-22", "2014-07-11")
ymd(dt2)

艳艳儿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Getting and Cleaning data] Week 4

Week 4Editing text variablesRegular expressionsWorking with datesMore details could be found in the html file here Week 4Editing text variablesImportant points about text in data setNames of variab
复制链接

扫一扫