4.1 editing text variables
4.2 regular expressions
4.3 working with Dates ,Data Resources
4.1 editing text variables
Fixing character vectors - tolower(), toupper()
[1] "address" "direction" "street" "crossStreet" "intersection" "Location.1"
[1] "address" "direction" "street" "crossstreet" "intersection" "location.1"
Fixing character vectors - strsplit()
- Good for automatically splitting variable names
- Important parameters: x, split
[1] "intersection"
[1] "Location" "1"
Quick aside - lists
$letters
[1] "A" "b" "c"
$numbers
[1] 1 2 3
[[3]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf
Quick aside - lists
$letters
[1] "A" "b" "c"
[1] "A" "b" "c"
[1] "A" "b" "c"
Fixing character vectors - sapply()
- Applies a function to each element in a vector or list
- Important parameters: X,FUN
[1] "Location"
[1] "address" "direction" "street" "crossStreet" "intersection" "Location"
Peer review data
id solution_id reviewer_id start stop time_left accept
1 1 3 27 1304095698 1304095758 1754 1
2 2 4 22 1304095188 1304095206 2306 1
id problem_id subject_id start stop time_left answer
1 1 156 29 1304095119 1304095169 2343 B
2 2 269 25 1304095119 1304095183 2329 C
Fixing character vectors - sub()
- Important parameters: pattern, replacement, x
[1] "id" "solution_id" "reviewer_id" "start" "stop" "time_left"
[7] "accept"
[1] "id" "solutionid" "reviewerid" "start" "stop" "timeleft" "accept"
Fixing character vectors - gsub()
[1] "thisis_a_test"
[1] "thisisatest"
Finding values - grep(),grepl()
[1] 4 5 36
FALSE TRUE
77 3
More on grep()
[1] "The Alameda & 33rd St" "E 33rd & The Alameda" "Harford \n & The Alameda"
integer(0)
[1] 0
http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf
More useful string functions
[1] 12
[1] "Jeffrey"
[1] "Jeffrey Leek"
More useful string functions
[1] "JeffreyLeek"
[1] "Jeff"
Important points about text in data sets
- Names of variables should be
- All lower case when possible
- Descriptive (Diagnosis versus Dx)
- Not duplicated
- Not have underscores or dots or white spaces
- Variables with character values
- Should usually be made into factor variables (depends on application)
- Should be descriptive (use TRUE/FALSE instead of 0/1 and Male/Female versus 0/1 or M/F)
4.2 正则表达式
Regular expressions
- Regular expressions can be thought of as a combination of literals and metacharacters
- To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar
- Regular expressions have a rich set of metacharacters
Literals
The literal “Obama” would match to the following lines
Politics r dum. Not 2 long ago Clinton was sayin Obama
was crap n now she sez vote 4 him n unite? WTF?
Screw em both + Mcain. Go Ron Paul!
Clinton conceeds to Obama but will her followers listen??
Are we sure Chelsea didn’t vote for Obama?
thinking ... Michelle Obama is terrific!
jetlag..no sleep...early mornig to starbux..Ms. Obama
was moving
Regular Expressions
-
Simplest pattern consists only of literals; a match occurs if the sequence of literals occurs anywhere in the text being tested
-
What if we only want the word “Obama”? or sentences that end in the word “Clinton”, or “clinton” or “clinto”?
Regular Expressions
We need a way to express
- whitespace word boundaries
- sets of literals
- the beginning and end of a line
- alternatives (“war” or “peace”) Metacharacters to the rescue!
Metacharacters
Some metacharacters represent the start of a line
^i think
will match the lines
i think we all rule for participating
i think i have been outed
i think this will be quite fun actually
i think i need to go to work
i think i first saw zombo in 1999.
Metacharacters
$ represents the end of a line
morning$
will match the lines
well they had something this morning
then had to catch a tram home in the morning
dog obedience school in the morning
and yes happy birthday i forgot to say it earlier this morning
I walked in the rain this morning
good morning
Character Classes with []
We can list a set of characters we will accept at a given point in the match
[Bb][Uu][Ss][Hh]
will match the lines
The democrats are playing, "Name the worst thing about Bush!"
I smelled the desert creosote bush, brownies, BBQ chicken
BBQ and bushwalking at Molonglo Gorge
Bush TOLD you that North Korea is part of the Axis of Evil
I’m listening to Bush - Hurricane (Album Version)
Character Classes with []
^[Ii] am
will match
i am so angry at my boyfriend i can’t even bear to
look at him
i am boycotting the apple store
I am twittering from iPhone
I am a very vengeful person when you ruin my sweetheart.
I am so over this. I need food. Mmmm bacon...
Character Classes with []
Similarly, you can specify a range of letters [a-z] or [a-zA-Z]; notice that the order doesn’t matter
^[0-9][a-zA-Z]
will match the lines
7th inning stretch
2nd half soon to begin. OSU did just win something
3am - cant sleep - too hot still.. :(
5ft 7 sent from heaven
1st sign of starvagtion
Character Classes with []
When used at the beginning of a character class, the “^” is also a metacharacter and indicates matching characters NOT in the indicated class
[^?.]$
will match the lines
i like basketballs
6 and 9
dont worry... we all die anyway!
Not in Baghdad
helicopter under water? hmmm
表示除?及。号之外的所有符号
More Metacharacters
“.” is used to refer to any character. So
9.11
will match the lines
its stupid the post 9-11 rules
if any 1 of us did 9/11 we would have been caught in days.
NetBios: scanning ip 203.169.114.66
Front Door 9:11:46 AM
Sings: 0118999881999119725...3 !
More Metacharacters: |
This does not mean “pipe” in the context of regular expressions; instead it translates to “or”; we can use it to combine two expressions, the subexpressions being called alternatives
flood|fire
will match the lines
is firewire like usb on none macs?
the global flood makes sense within the context of the bible
yeah ive had the fire on tonight
... and the floods, hurricanes, killer heatwaves, rednecks, gun nuts, etc.

More Metacharacters: |
We can include any number of alternatives...
flood|earthquake|hurricane|coldfire
will match the lines
Not a whole lot of hurricanes in the Arctic.
We do have earthquakes nearly every day somewhere in our State
hurricanes swirl in the other direction
coldfire is STRAIGHT!
’cause we keep getting earthquakes
More Metacharacters: |
The alternatives can be real expressions and not just literals
^[Gg]ood|[Bb]ad
will match the lines
good to hear some good knews from someone here
Good afternoon fellow american infidels!
good on you-what do you drive?
Katie... guess they had bad experiences...
my middle name is trouble, Miss Bad News
More Metacharacters: ( and )
Subexpressions are often contained in parentheses to constrain the alternatives
^([Gg]ood|[Bb]ad)
will match the lines
bad habbit
bad coordination today
good, becuase there is nothing worse than a man in kinky underwear
Badcop, its because people want to use drugs
Good Monday Holiday
Good riddance to Limey
More Metacharacters: ?
The question mark indicates that the indicated expression is optional
[Gg]eorge( [Ww]\.)? [Bb]ush
will match the lines
i bet i can spell better than you and george bush combined
BBC reported that President George W. Bush claimed God told him to invade I
a bird in the hand is worth two george bushes
意味着中间的那个()里的东东是可选的
One thing to note...
In the following
[Gg]eorge( [Ww]\.)? [Bb]ush
we wanted to match a “.” as a literal period; to do that, we had to “escape” the metacharacter, preceding it with a backslash In general, we have to do this for any metacharacter we want to include in our match
More metacharacters: * and +
The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”
(.*)
will match the lines
anyone wanna chat? (24, m, germany)
hello, 20.m here... ( east area + drives + webcam )
(he means older men)
()
More metacharacters: * and +
The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”
[0-9]+ (.*)[0-9]+
will match the lines
working as MP here 720 MP battallion, 42nd birgade
so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin
it went down on several occasions for like, 3 or 4 *days*
Mmmm its time 4 me 2 go 2 bed
More metacharacters: { and }
{ and } are referred to as interval quantifiers; the let us specify the minimum and maximum number of matches of an expression
[Bb]ush( +[^ ]+ +){1,5} debate
will match the lines
Bush has historically won all major debates he’s done.
in my view, Bush doesn’t need these debates..
bush doesn’t need the debates? maybe you are right
That’s what Bush supporters are doing about the debate.
Felix, I don’t disagree that Bush was poorly prepared for the debate.
indeed, but still, Bush should have taken the debate more seriously.
Keep repeating that Bush smirked and scowled during the debate
More metacharacters: and
- m,n means at least m but not more than n matches
- m means exactly m matches
- m, means at least m matches
More metacharacters: ( and ) revisited
- In most implementations of regular expressions, the parentheses not only limit the scope of alternatives divided by a “|”, but also can be used to “remember” text matched by the subexpression enclosed
- We refer to the matched text with \1, \2, etc.
More metacharacters: ( and ) revisited
So the expression,这里面的+\1表示重复一次
+([a-zA-Z]+) +\1 +
will match the lines
time for bed, night night twitter!
blah blah blah blah
my tattoo is so so itchy today
i was standing all all alone against the world outside...
hi anybody anybody at home
estudiando css css css css.... que desastritooooo
More metacharacters: ( and ) revisited
The * is “greedy” so it always matches the longest possible string that satisfies the regular expression. So
^s(.*)s
matches 这种呢就会自动匹配最长的字符
sitting at starbucks
setting up mysql and rails
studying stuff for the exams
spaghetti with marshmallows
stop fighting with crackers
sore shoulders, stupid ergonomics
More metacharacters: ( and ) revisited
The greediness of * can be turned off with the ?, as in
^s(.*?)s$
Summary
- Regular expressions are used in many different languages; not unique to R.
- Regular expressions are composed of literals and metacharacters that represent sets or classes of characters/words
- Text processing via regular expressions is a very powerful way to extract data from “unfriendly” sources (not all data comes as a CSV file)
- Used with the functions
grep
,grepl
,sub
,gsub
and others that involve searching for text strings (Thanks to Mark Hansen for some material in this lecture.)
4.3 处理日期数据及数据来源
Starting simple
[1] "Sun Jan 12 17:48:33 2014"
[1] "character"
Formatting dates
%d
= day as number (0-31), %a
= abbreviated weekday,%A
= unabbreviated weekday, %m
= month (00-12), %b
= abbreviated month, %B
= unabbrevidated month, %y
= 2 digit year, %Y
= four digit year
[1] "Sun Jan 12"
Creating dates
[1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
Time difference of -1 days
[1] -1
Converting to Julian
[1] "Sunday"
[1] "January"
[1] 16082
attr(,"origin")
[1] "1970-01-01"
Lubridate
[1] "2014-01-08 UTC"
[1] "2013-08-04 UTC"
[1] "2013-04-03 UTC"
http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
这个包说是用起来相当不错,在处理日期型数据的时候Dealing with times
[1] "2011-08-03 10:15:03 UTC"
[1] "2011-08-03 10:15:03 NZST"
Some functions have slightly different syntax
[1] 3
[1] Tues
Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
Notes and further resources
- More information in this nice lubridate tutorial http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
- The lubridate vignette is the same content http://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
- Ultimately you want your dates and times as class "Date" or the classes "POSIXct", "POSIXlt". For more information type
?POSIXlt
Open Government Sites
- United Nations http://data.un.org/
- U.S. http://www.data.gov/
- United Kingdom http://data.gov.uk/
- France http://www.data.gouv.fr/
- Ghana http://data.gov.gh/
- Australia http://data.gov.au/
- Germany https://www.govdata.de/
- Hong Kong http://www.gov.hk/en/theme/psi/datasets/
- Japan http://www.data.go.jp/
- Many more http://www.data.gov/opendatasites
Collections by data scientists
- Hilary Mason http://bitly.com/bundles/hmason/1
- Peter Skomoroch https://delicious.com/pskomoroch/dataset
- Jeff Hammerbacher http://www.quora.com/Jeff-Hammerbacher/Introduction-to-Data-Science-Data-Sets
- Gregory Piatetsky-Shapiro http://www.kdnuggets.com/gps.html
- http://blog.mortardata.com/post/67652898761/6-dataset-lists-curated-by-data-scientists
More specialized collections
- Stanford Large Network Data
- UCI Machine Learning
- KDD Nugets Datasets
- CMU Statlib
- Gene expression omnibus
- ArXiv Data
- Public Data Sets on Amazon Web Services
Some API's with R interfaces
- twitter and twitteR package
- figshare and rfigshare
- PLoS and rplos
- rOpenSci
- Facebook and RFacebook
- Google maps and RGoogleMaps