Text-Mining-DataCamp-String Manipulation with stringr in R
1. String Basics
1.1 Welcome! (video)
1.2 Quotes
Instruction:
Following the guidelines for using quotes, define the three strings, line1
, line2
and line3
:
- Line 1:
The table was a large one, but the three were all crowded together at one corner of it:
- Line 2:
"No room! No room!" they cried out when they saw Alice coming.
- Line 3:
"There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.
# Define line1
line1 <- "The table was a large one, but the three were all crowded together at one corner of it:"
# Define line2
line2 <- '"No room! No room!" they cried out when they saw Alice coming.'
# Define line3
line3 <- '"There\'s plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.'
1.3 What you see isn’t always what you have
Instruction:
We’ve put your lines from Alice’s Adventures in Wonderland in a vector called lines
.
- Take a look at
lines
to see R’s representation of the strings. - Pass
lines
towriteLines()
to see the content of strings you’ve created. - By default
writeLines()
separates the strings with a newline, which you can change using thesep
argument. Writelines
to the screen again, but this time set thesep
argument to a space, " ". - Finally, try using
writeLines()
on the string"hello\n\U1F30D"
. You’ll learn about what’s going on here in the next exercise.
# Putting lines in a vector
lines <- c(line1, line2, line3)
# Print lines
lines
# Use writeLines() on lines
writeLines(lines)
# Write lines with a space separator
writeLines(lines, sep = " ")
# Use writeLines() on the string "hello\n\U1F30D"
writeLines("hello\n\U1F30D")
1.4 Escape sequences
Instruction:
- Edit the string inside
writeLines()
so that it correctly displays (all on one line):
To have a \ you need \\
- Edit the string inside
writeLines()
so that it correctly displays (with the line breaks in these positions)
This is a really
really really
long string
- Try
writeLines()
with the string containing Unicode characters:
"\u0928\u092e\u0938\u094d\u0924\u0947 \u0926\u0941\u0928\u093f\u092f\u093e". You just said "Hello World" in Hindi!
# Should display: To have a \ you need \\
writeLines("To have a \\ you need \\\\")
# Should display:
# This is a really
# really really
# long string
writeLines("This is a really \nreally \nreally \nlong string")
# Use writeLines() with
# "\u0928\u092e\u0938\u094d\u0924\u0947 \u0926\u0941\u0928\u093f\u092f\u093e"
writeLines("\u0928\u092e\u0938\u094d\u0924\u0947\u0926\u0941\u0928\u093f\u092f\u093e")
1.5 Turning numbers into strings (video)
1.6 Using format() with numbers
Instruction:
- Format
c(0.0011, 0.011, 1)
withdigits = 1
. This is like the example described above. - Now, format
c(1.0011, 2.011, 1)
withdigits = 1
. Try to predict what you might get before you try it. - Format
percent_change
by choosing thedigits
argument so that the values are presented with one place after the decimal point. - Format
income
by choosing thedigits
argument so that the values are presented as whole numbers (i.e. no places after the decimal point). - Format
p_values
using a fixed representation.
# Some vectors of numbers
percent_change <- c(4, -1.91, 3.00, -5.002)
income <- c(72.19, 1030.18, 10291.93, 1189192.18)
p_values <- c(0.12, 0.98, 0.0000191, 0.00000000002)
# Format c(0.0011, 0.011, 1) with digits = 1
format(c(0.0011, 0.011, 1), digits = 1)
# Format c(1.0011, 2.011, 1) with digits = 1
format(c(1.0011, 2.011, 1), digits = 1)
# Format percent_change to one place after the decimal point
format(percent_change, digits = 2)
# Format income to whole numbers
format(income, digits = 2)
# Format p_values in fixed format
format(p_values, scientific = FALSE)
1.7 Controlling other aspects of the string formatC()
Instruction:
We’ve assigned your formatted income
from the previous exercise to formatted_income
.
- Print
formatted_income
Notice the spaces at the start of the strings. - Call
writeLines()
on the formattedincome
. Notice how the numbers line up on the decimal point. - Define
trimmed_income
by usingformat()
onincome
withdigits = 2
andtrim = TRUE
. - Call
writeLines()
ontrimmed_income
. Notice how this removes the spaces at the start of the strings and the values line up on left. - Define
pretty_income
by usingformat()
onincome
withdigits = 2
andbig.mark = ","
. - Call
writeLines()
on pretty_income.
formatted_income <- format(income, digits = 2)
# Print formatted_income
formatted_income
# Call writeLines() on the formatted income
writeLines(formatted_income)
# Define trimmed_income
trimmed_income <- format(income, digits = 2, trim = TRUE)
# Call writeLines() on the trimmed_income
writeLines(trimmed_income)
# Define pretty_income
pretty_income <- format(income, digits = 2, big.mark = ",")
# Call writeLines() on the pretty_income
writeLines(pretty_income)
1.8 formatC()
Instruction:
The vectors income
, percent_change
, and p_values
are available in your workspace.
- First, compare the behavior of
formatC()
toformat()
by callingformatC()
onx
withformat = "f"
anddigits = 1
. This is the same vector you used with format(), do you see the difference? - Call
formatC()
ony
withformat = "f"
anddigits = 1
. Notice howdigits
has consistent behavior regardless of the vector you format. - Format
percent_change
to one decimal place after the decimal point. - Format
percent_change
to one decimal place after the decimal point and addflag = "+"
. This forces the display of the sign. - Format
p_values
usingformat = "g"
anddigits = 2
. This can be useful, since if there are any p-values in scientific notation, they must be < 0.0001.
# From the format() exercise
x <- c(0.0011, 0.011, 1)
y <- c(1.0011, 2.011, 1)
# formatC() on x with format = "f", digits = 1
formatC(x, format = "f", digits = 1)
# formatC() on y with format = "f", digits = 1
formatC(y, format = "f", digits = 1)
# Format percent_change to one place after the decimal point
formatC(percent_change, format = "f", digits = 1)
# percent_change with flag = "+"
formatC(percent_change, format = "f", digits = 1, flag = "+")
# Format p_values using format = "g" and digits = 2
formatC(p_values, format = "g", digits = 2)
1.9 Putting strings together (video)
1.10 Annotation of numbers
Instruction:
We’ve put the formatted vectors pretty_income
and pretty_percent
in your workspace along with years
.
- Paste a
$
to the front of each value inpretty_income
, usesep = ""
, so there is no space between the$
and value. - Paste a
%
to the end of each value inpretty_percent
, usesep = ""
, so there is no space between the value and the%
. years
contains the year eachpretty_percent
corresponds to. Usepaste()
to produce a vector with elements like2010: +4.0%
and assign it toyear_percent
.- Use
paste()
withyear_percent
to create single string that collapses all the years:2010: +4.0%, 2011: -1.9%, 2012: +3.0%, 2013: -5.0%
# Add % to pretty_percent
paste("$", pretty_income, sep = "")
# Add % to pretty_percent
paste(pretty_percent, "%", sep = "")
# Create vector with elements like 2010: +4.0%`
year_percent <- paste(years, ": ", pretty_percent, "%", sep = "")
# Collapse all years into single string
paste(year_percent, collapse = ",")
1.11 A very simple table
The income
vector is loaded in your workspace.
- Create
pretty_income
by usingformat()
withdigits = 2
andbig.mark = ","
. - Create
dollar_income
by pasting$
topretty_income
(don’t forget to set thesep
argument). - Create
formatted_names
by usingformat()
onincome_names
withjustify = "right"
. - Create
rows
by pasting togetherformatted_names
anddollar_income
. Use three spaces as a separator to give some room between your columns. Be sure to surround your separator in"
. - Call
writeLines()
onrows
to see your table.
Instruction:
# Define the names vector
income_names <- c("Year 0", "Year 1", "Year 2", "Project Lifetime")
# Create pretty_income
pretty_income <- format(income, digits = 2, big.mark = ",")
# Create dollar_income
dollar_income <- paste("$", pretty_income, sep = "")
# Create formatted_names
formatted_names <- format(income_names, justify = "right")
# Create rows
rows <- paste(formatted_names, dollar_income, sep = " ")
# Write rows
writeLines(rows)
1.12 Let’s order pizza!
Instruction:
- Print
my_toppings
to see your random toppings. - Add
"and "
to the start of the third element by usingpaste()
withmy_toppings
and a vector you define. - Create a vector
these_toppings
by usingpaste()
to collapsemy_toppings_and
with a comma and space between each element. - Create
my_order
by pasting"I want to order a pizza with "
tothese_toppings
and ending with a period,"."
. - Order your pizza by calling
writeLines()
onmy_order
. - Try re-running all your code (including the sampling of toppings). You should get a brand new pizza order!
# Randomly sample 3 toppings
my_toppings <- sample(toppings, size = 3)
# Print my_toppings
my_toppings
# Paste "and " to last element: my_toppings_and
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")
# Collapse with comma space: these_toppings
these_toppings <- paste(my_toppings_and, collapse = ", ")
# Add rest of sentence: my_order
my_order <- paste("I want to order a pizza with ", these_toppings, ".", sep = "")
# Order pizza with writeLines()
writeLines(my_order)
2. Introduction to Stringr
2.1 Introducing stringr (video)
2.2 Putting strings together with stringr
Instruction:
library(stringr)
my_toppings <- c("cheese", NA, NA)
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")
# Print my_toppings_and
my_toppings_and
# Use str_c() instead of paste(): my_toppings_str
my_toppings_str <- str_c(c("", "", "and "), my_toppings, sep = "")
# Print my_toppings_str
my_toppings_str
# paste() my_toppings_and with collapse = ", "
paste(my_toppings_and, collapse = ", ")
# str_c() my_toppings_str with collapse = ", "
str_c(my_toppings_str, collapse = ", ")
2.3 String length
Instruction:
library(stringr)
library(babynames)
library(dplyr)
# Extracting vectors for boys' and girls' names
babynames_2014 <- filter(babynames, year == 2014)
boy_names <- filter(babynames_2014, sex == "M")$name
girl_names <- filter(babynames_2014, sex == "F")$name
# Take a look at a few boy_names
head(boy_names)
# Find the length of all boy_names
boy_length <- str_length(boy_names)
# Take a look at a few lengths
head(boy_length)
# Find the length of all girl_names
girl_length <- str_length(girl_names)
# Find the difference in mean length
mean(girl_length) - mean(boy_length)
# Confirm str_length() works with factors
head(str_length(factor(boy_names)))
2.4 Extracting substrings
Instruction:
# Extract first letter from boy_names
boy_first_letter <- str_sub(boy_names, 1, 1)
# Tabulate occurrences of boy_first_letter
table(boy_first_letter)
# Extract the last letter in boy_names, then tabulate
boy_last_letter <- str_sub(boy_names, -1,-1)
table(boy_last_letter)
# Extract the first letter in girl_names, then tabulate
girl_first_letter <- str_sub(girl_names, 1, 1)
table(girl_first_letter)
# Extract the last letter in girl_names, then tabulate
girl_last_letter <- str_sub(girl_names, -1, -1)
table(girl_last_letter)
2.5 Hunting for matches (video)
2.6 Detecting matches
Instruction:
# Look for pattern "zz" in boy_names
contains_zz <- str_detect(boy_names, pattern = fixed("zz"))
# Examine str() of contains_zz
str(contains_zz)
# How many names contain "zz"?
sum(contains_zz)
# Which names contain "zz"?
boy_names[contains_zz]
# Which rows in boy_df have names that contain "zz"?
boy_df <- filter(babynames_2014, sex == "M")
boy_df[contains_zz,]
2.7 Subsetting strings based on match
Instruction:
# Find boy_names that contain "zz"
str_subset(boy_names, pattern = fixed("zz"))
# Find girl_names that contain "zz"
str_subset(girl_names, pattern = fixed("zz"))
# Find girl_names that contain "U"
starts_U <- str_subset(girl_names, pattern = fixed("U"))
starts_U
# Find girl_names that contain "U" and "z"
str_subset(starts_U, pattern = "z")
2.8 Counting matches
Instruction:
# Count occurrences of "a" in girl_names
number_as <- str_count(girl_names, pattern = fixed("a"))
# Count occurrences of "A" in girl_names
number_As <- str_count(girl_names, pattern = fixed("A"))
# Histograms of number_as and number_As
hist(number_as)
hist(number_As)
# Find total "a" + "A"
total_as <- number_As + number_as
# girl_names with more than 4 a's
girl_names[total_as > 4]
2.9 Splitting strings (video)
2.10 Parsing strings into variables
Instruction 1:
# Some date data
date_ranges <- c("23.01.2017 - 29.01.2017", "30.01.2017 - 06.02.2017")
# Split dates using " - "
split_dates <- str_split(date_ranges, pattern = fixed(" - "))
split_dates
Instruction 2:
# Some date data
date_ranges <- c("23.01.2017 - 29.01.2017", "30.01.2017 - 06.02.2017")
# Split dates with n and simplify specified
split_dates_n <- str_split(date_ranges, pattern = fixed(" - "), simplify = TRUE, n = 2)
split_dates_n
Instruction 3:
# From previous step
date_ranges <- c("23.01.2017 - 29.01.2017", "30.01.2017 - 06.02.2017")
split_dates_n <- str_split(date_ranges, fixed(" - "), n = 2, simplify = TRUE)
# Subset split_dates_n into start_dates and end_dates
start_dates <- split_dates_n[, 1]
end_dates <- split_dates_n[, 2]
# Split start_dates into day, month and year pieces
str_split(start_dates, pattern = fixed("."), simplify = TRUE)
Instruction 4:
both_names <- c("Box, George", "Cox, David")
# Split both_names into first_names and last_names
both_names_split <- str_split(both_names, pattern = fixed(", "), simplify = TRUE)
# Get first names
first_names <- both_names_split[, 2]
# Get last names
last_names <- both_names_split[, 1]
2.11 Some simple text statistics
Instruction:
# Split lines into words
words <- str_split(lines, pattern = fixed(" "))
# Number of words per line
lapply(words, length)
# Number of characters in each word
word_lengths <- lapply(words, str_length)
# Average word length per line
lapply(word_lengths, mean)
2.12 Replacing matches in strings (video)
2.13 Replacing to tidy strings
Instruction 1:
# Some IDs
ids <- c("ID#: 192", "ID#: 118", "ID#: 001")
# Replace "ID#: " with ""
id_nums <- str_replace(ids, "ID#: ", "")
# Turn id_nums into numbers
id_ints <- as.numeric(id_nums)
Instruction 2:
# Some (fake) phone numbers
phone_numbers <- c("510-555-0123", "541-555-0167")
# Use str_replace() to replace "-" with " "
str_replace(phone_numbers, "-", " ")
# Use str_replace_all() to replace "-" with " "
str_replace_all(phone_numbers, "-", " ")
# Turn phone numbers into the format xxx.xxx.xxxx
str_replace_all(phone_numbers, "-", ".")
2.14 Review
Instruction:
# Find the number of nucleotides in each sequence
str_length(genes)
# Find the number of A's occur in each sequence
str_count(genes, pattern = fixed("A"))
# Return the sequences that contain "TTTTTT"
str_subset(genes, pattern = fixed("TTTTTT"))
# Replace all the "A"s in the sequences with a "_"
str_replace_all(genes, pattern = fixed("A"), replacement = "_")
2.15 Final challenges
Instruction 1:
# Define some full names
names <- c("Diana Prince", "Clark Kent")
# Split into first and last names
names_split <- str_split(names, pattern = fixed(" "), simplify = TRUE)
# Extract the first letter in the first name
abb_first <- str_sub(names_split[, 1], 1, 1)
# Combine the first letter ". " and last name
str_c(abb_first,". ", names_split[,2])
Instruction 2:
# Use all names in babynames_2014
all_names <- babynames_2014$name
# Get the last two letters of all_names
last_two_letters <- str_sub(all_names, -2, -1)
# Does the name end in "ee"?
ends_in_ee <- str_detect(last_two_letters, pattern = fixed("ee"))
# Extract rows and "sex" column
sex <- babynames_2014$sex[ends_in_ee]
# Display result as a table
table(sex)
3. Pattern Matching with Regular Expressions
3.1 Regular expressions (video)
3.2 Matching the start or end of the string
Instruction 1:
# Some strings to practice with
x <- c("cat", "coat", "scotland", "tic toc")
# Print END
END
# Run me
str_view(x, pattern = START %R% "c")
Instruction 2:
# Match the strings that start with "co"
str_view(x, pattern = START %R% "co")
Instruction 3:
# Match the strings that end with "at"
str_view(x, pattern =
"at" %R% END)
Instruction 4:
# Match the strings that is exactly "cat"
str_view(x, pattern = START %R% "cat" %R% END)
3.3 Matching any character
Instruction 1:
# Match two characters, where the second is a "t"
str_view(x, pattern = ANY_CHAR %R% "t")
Instruction 2:
# Match a "t" followed by any character
str_view(x, pattern = "t" %R% ANY_CHAR)
Instruction 3:
# Match two characters
str_view(x, pattern = ANY_CHAR %R% ANY_CHAR)
Instruction 4:
# Match a string with exactly three characters
str_view(x, pattern = START %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% END)
3.4 Combining with stringr functions
Instruction 1:
pattern <- "q" %R% ANY_CHAR
# Find names that have the pattern
names_with_q <- str_subset(boy_names, pattern)
# How many names were there?
length(names_with_q)
Instruction 2:
# Find part of name that matches pattern
part_with_q <- str_extract(boy_names, pattern)
# Get a table of counts
table(part_with_q)
Instruction 3:
# Did any names have the pattern more than once?
count_of_q <- str_count(boy_names, pattern)
# Get a table of counts
table(count_of_q)
Instruction 4:
# Which babies got these names? (get logical vector back)
with_q <- str_detect(boy_names, pattern)
# What fraction of babies got these names? (get mean)
mean(with_q)
3.5 More regular expressions (video)
3.6 Alternation
Instruction 1:
# Match Jeffrey or Geoffrey
whole_names <- or("Jeffrey", "Geoffrey")
str_view(boy_names, pattern = whole_names,
match = TRUE)
Instruction 2:
# Match Jeffrey or Geoffrey, another way
common_ending <- or("Je", "Geo") %R% "ffrey"
str_view(boy_names, pattern = common_ending,
match = TRUE)
Instruction 3:
# Match with alternate endings
by_parts <- or("Je", "Geo") %R% "ff" %R% or("ry", "ery", "rey", "erey")
str_view(boy_names,
pattern = by_parts,
match = TRUE)
Instruction 4:
# Match names that start with Cath or Kath
ckath <- or("C", "K") %R% "ath"
str_view(girl_names, pattern = ckath, match = TRUE)
3.7 Character classes
Instruction 1:
# Create character class containing vowels
vowels <- char_class("aeiouAEIOU")
# Print vowels
vowels
# See vowels in x with str_view()
str_view(x, vowels)
Instruction 2:
# See vowels in x with str_view_all()
str_view_all(x, vowels)
Instruction 3:
# Number of vowels in boy_names
num_vowels <- str_count(boy_names, vowels)
# Number of characters in boy_names
name_length <- str_length(boy_names)
Instruction 4:
# Calc mean number of vowels
mean(num_vowels)
# Calc mean fraction of vowels per name
mean(num_vowels/name_length)
3.8 Repetition
Instruction 1:
# Vowels from last exercise
vowels <- char_class("aeiouAEIOU")
# See names with only vowels
str_view(boy_names,
pattern = exactly(one_or_more(vowels)),
match = TRUE)
Instruction 2:
# Use `negated_char_class()` for everything but vowels
not_vowels <- negated_char_class("aeiouAEIOU")
# See names with no vowels
str_view(boy_names,
pattern = exactly(one_or_more(not_vowels)),
match = TRUE)
3.9 Shortcuts (video)
3.10 Hunting for phone numbers
Instruction 1:
# Create a three digit pattern and test
three_digits <- DGT %R% DGT %R% DGT
# Test it
str_view_all(contact, pattern = three_digits)
Instruction 2:
# Create a separator pattern and test
separator <- char_class("-.() ")
# Test it
str_view_all(contact, pattern = separator)
Instruction 3:
# Use these components
three_digits <- DGT %R% DGT %R% DGT
four_digits <- three_digits %R% DGT
separator <- char_class("-.() ")
# Create phone pattern
phone_pattern <- optional(OPEN_PAREN) %R%
three_digits %R%
zero_or_more(separator) %R%
three_digits %R%
zero_or_more(separator) %R%
four_digits
# Test pattern
str_view_all(contact, phone_pattern)
Instruction 4:
# Use this pattern
three_digits <- DGT %R% DGT %R% DGT
four_digits <- three_digits %R% DGT
separator <- char_class("-.() ")
phone_pattern <- optional(OPEN_PAREN) %R%
three_digits %R%
zero_or_more(separator) %R%
three_digits %R%
zero_or_more(separator) %R%
four_digits
# Extract phone numbers
str_extract(contact, phone_pattern)
# Extract ALL phone numbers
str_extract_all(contact, phone_pattern)
3.11 Extracting age and gender from accident narratives
Instruction 1:
# Pattern to match one or two digits
age <- DGT %R% optional(DGT)
# Test it
str_view(narratives, pattern = age)
Instruction 2:
# Use this pattern
age <- DGT %R% optional(DGT)
# Pattern to match units
unit <- optional(SPC) %R% or("YO", "YR", "MO")
# Test pattern with age then units
str_view(narratives,
pattern = age %R% unit)
Instruction 3:
# Use these patterns
age <- DGT %R% optional(DGT)
unit <- optional(SPC) %R% or("YO", "YR", "MO")
# Pattern to match gender
gender <- optional(SPC) %R% char_class("MF")
# Test pattern with age then units then gender
str_view(narratives,
pattern = age %R% unit %R% gender)
Instruction 4:
# Use these patterns
age <- DGT %R% optional(DGT)
unit <- optional(SPC) %R% or("YO", "YR", "MO")
gender <- optional(SPC) %R% or("M", "F")
# Extract age_gender, take a look
age_gender <- str_extract(narratives, pattern = age %R% unit %R% gender)
age_gender
3.12 Parsing age and gender into pieces
Instruction 1:
# age_gender, age, gender, unit are pre-defined
ls.str()
# Extract age and make numeric
as.numeric(str_extract(age_gender, age))
Instruction 2:
# Replace age and units with ""
genders <- str_remove(age_gender, pattern = age %R% unit)
# Replace extra spaces
str_remove_all(genders, pattern = one_or_more(SPC))
Instruction 3:
# Numeric ages, from previous step
ages_numeric <- as.numeric(str_extract(age_gender, age))
# Extract units
time_units <- str_extract(age_gender, unit)
# Extract first word character
time_units_clean <- str_extract(time_units, WRD)
# Turn ages in months to years
ifelse(time_units_clean == "Y", ages_numeric, ages_numeric / 12)
4. More Advanced Matching and Manipulation
4.1 Capturing (video)
4.2 Capturing parts of a pattern
Instruction 1:
# Capture part between @ and . and after .
email <- capture(one_or_more(WRD)) %R%
"@" %R% capture(one_or_more(WRD)) %R%
DOT %R% capture(one_or_more(WRD))
# Check match hasn't changed
str_view(hero_contacts, pattern = email)
Instruction 2:
# Pattern from previous step
email <- capture(one_or_more(WRD)) %R%
"@" %R% capture(one_or_more(WRD)) %R%
DOT %R% capture(one_or_more(WRD))
# Pull out match and captures
email_parts <- str_match(hero_contacts, pattern = email)
email_parts
# Save host
host <- email_parts[, 3]
host
4.3 Pulling out parts of a phone number
Instruction 1:
# View text containing phone numbers
contact
# Add capture() to get digit parts
phone_pattern <- capture(three_digits) %R% zero_or_more(separator) %R%
capture(three_digits) %R% zero_or_more(separator) %R%
capture(four_digits)
# Pull out the parts with str_match()
phone_numbers <- str_match(contact, phone_pattern)
# Put them back together
str_c(
"(",
phone_numbers[, 2],
") ",
phone_numbers[, 3],
"-",
phone_numbers[, 4])
4.4 Extracting age and gender again
Instruction 1:
# narratives has been pre-defined
narratives
# Add capture() to get age, unit and sex
pattern <- capture(optional(DGT) %R% DGT) %R%
optional(SPC) %R% capture(or("YO", "YR", "MO")) %R%
optional(SPC) %R% capture(or("M", "F"))
# Pull out from narratives
str_match(narratives, pattern = pattern)
Instruction 2:
# Edit to capture just Y and M in units
pattern2 <- capture(optional(DGT) %R% DGT) %R%
optional(SPC) %R% capture(or("Y", "M")) %R% optional(or("O","R")) %R%
optional(SPC) %R% capture(or("M", "F"))
# Check pattern
str_view(narratives, pattern = pattern2)
# Pull out pieces
str_match(narratives, pattern = pattern2)
4.5 Backreferences (video)
4.6 Using backreferences in patterns
Instruction 1:
# See names with three repeated letters
repeated_three_times <- capture(LOWER) %R% REF1 %R% REF1
# Test it
str_view(boy_names, pattern = repeated_three_times, match = TRUE)
Instruction 2:
# See names with a pair of repeated letters, egeg. abab
pair_of_repeated <- capture(LOWER %R% LOWER) %R% REF1
# Test it
str_view(boy_names, pattern = pair_of_repeated, match = TRUE)
Instruction 3:
# See names with a pair that reverses, e.g. abba
pair_that_reverses <- capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1
# Test it
str_view(boy_names, pattern = pair_that_reverses, match = TRUE)
Instruction 4:
# Four letter palindrome names
four_letter_palindrome <- exactly(
capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1
)
# Test it
str_view(boy_names, pattern = four_letter_palindrome, match = TRUE)
4.7 Replacing with regular expressions
Instruction:
# View text containing phone numbers
contact
# Replace digits with "X"
str_replace(contact, pattern = DGT, replacement = "X")
# Replace all digits with "X"
str_replace_all(contact, pattern = DGT, replacement = "X")
# Replace all digits with different symbol
str_replace_all(contact, pattern = DGT,
replacement = c("X", ".", "*", "_"))
4.8 Replacing with backreferences
Instruction:
# Build pattern to match words ending in "ING"
pattern <- one_or_more(WRD) %R% "ING"
str_view(narratives, pattern)
# Test replacement
str_replace(narratives, capture(pattern),
str_c("CARELESSLY", REF1, sep = " "))
# One adverb per narrative
adverbs_10 <- sample(adverbs, 10)
4.9 Unicode and pattern matching (video)
4.10 Matching a specific code point or code groups
Instruction:
# Names with builtin accents
(tay_son_builtin <- c(
"Nguy\u1ec5n Nh\u1ea1c",
"Nguy\u1ec5n Hu\u1ec7",
"Nguy\u1ec5n Quang To\u1ea3n"
))
# Convert to separate accents
tay_son_separate <- stri_trans_nfd(tay_son_builtin)
# Verify that the string prints the same
tay_son_separate
# Match all accents
str_view_all(tay_son_separate, pattern = UP_DIACRITIC)
4.11 Matching a single grapheme
Instruction 1:
# tay_son_separate has been pre-defined
tay_son_separate
# View all the characters in tay_son_separate
str_view_all(tay_son_separate, pattern = ANY_CHAR)
Instruction 2:
# View all the graphemes in tay_son_separate
str_view_all(tay_son_separate, pattern = GRAPHEME)
Instruction 3:
# Combine the diacritics with their letters
tay_son_builtin <- stri_trans_nfc(tay_son_separate)
# View all the graphemes in tay_son_builtin
str_view_all(tay_son_builtin, pattern = GRAPHEME)
5. Case Studies
5.1 A case study, reading a play (video)
5.2 Getting the play into R
Instruction 1:
# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)
Instruction 2:
# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)
# Detect start and end lines
start <- which(str_detect(earnest, fixed("START OF THE PROJECT")))
end <- which(str_detect(earnest, fixed("END OF THE PROJECT")))
# Get rid of gutenberg intro text
earnest_sub <- earnest[(start + 1):(end - 1)]
Instruction 3:
# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)
# Detect start and end lines
start <- str_which(earnest, fixed("START OF THE PROJECT"))
end <- str_which(earnest, fixed("END OF THE PROJECT"))
# Get rid of gutenberg intro text
earnest_sub <- earnest[(start + 1):(end - 1)]
# Detect first act
lines_start <- which(str_detect(earnest_sub, fixed("FIRST ACT")))
# Set up index
intro_line_index <- 1:(lines_start - 1)
# Split play into intro and play
intro_text <- earnest_sub[intro_line_index]
play_text <- earnest_sub[-intro_line_index
Instruction 4:
# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)
# Detect start and end lines
start <- str_which(earnest, fixed("START OF THE PROJECT"))
end <- str_which(earnest, fixed("END OF THE PROJECT"))
# Get rid of gutenberg intro text
earnest_sub <- earnest[(start + 1):(end - 1)]
# Detect first act
lines_start <- str_which(earnest_sub, fixed("FIRST ACT"))
# Set up index
intro_line_index <- 1:(lines_start - 1)
# Split play into intro and play
intro_text <- earnest_sub[intro_line_index]
play_text <- earnest_sub[-intro_line_index]
# Take a look at the first 20 lines
writeLines(play_text[1:20])
5.3 Identifying the lines, take 1
Instruction 1:
# Pattern for start word then .
pattern_1 <- START %R% one_or_more(WRD) %R% DOT
# Test pattern_1
str_view(play_lines, pattern = pattern_1,
match = TRUE) #to see matched lines
str_view(play_lines, pattern = pattern_1,
match = FALSE)
Instruction 2:
# Pattern for start, capital, word then .
pattern_2 <- START %R% ascii_upper() %R% one_or_more(WRD) %R% DOT
# Test pattern_2
str_view(play_lines, pattern_2, match = TRUE)
str_view(play_lines, pattern_2, match = FALSE)
Instruction 3:
# Pattern from last step
pattern_2 <- START %R% ascii_upper() %R% one_or_more(WRD) %R% DOT
# Get subset of lines that match
lines <- str_subset(play_lines, pattern = pattern_2)
# Extract match from lines
who <- str_extract(lines, pattern = pattern_2)
# Let's see what we have
unique(who)
5.4 Identifying the lines, take 2
Instruction 1:
# Create vector of characters
characters <- c("Algernon", "Jack", "Lane", "Cecily", "Gwendolen", "Chasuble",
"Merriman", "Lady Bracknell", "Miss Prism")
# Match start, then character name, then .
pattern_3 <- START %R% or1(characters) %R% DOT
# View matches of pattern_3
str_view(play_lines, pattern = pattern_3, match = TRUE)
# View non-matches of pattern_3
str_view(play_lines, pattern = pattern_3, match = FALSE)
Instruction 2:
# Variables from previous step
characters <- c("Algernon", "Jack", "Lane", "Cecily", "Gwendolen", "Chasuble",
"Merriman", "Lady Bracknell", "Miss Prism")
pattern_3 <- START %R% or1(characters) %R% DOT
# Pull out matches
lines <- str_subset(play_lines, pattern = pattern_3)
# Extract match from lines
who <- str_extract(lines, pattern = pattern_3)
# Let's see what we have
unique(who)
# Count lines per character
table(who)
5.5 A case study on case (video)
5.6 Changing case to ease matching
Instruction 1:
# catcidents has been pre-defined
head(catcidents)
# Construct pattern of DOG in boundaries
whole_dog_pattern <- whole_word("DOG")
# View matches to word "DOG"
str_view(catcidents, pattern = whole_dog_pattern, match = TRUE)
Instruction 2:
# From previous step
whole_dog_pattern <- whole_word("DOG")
# Transform catcidents to upper case
catcidents_upper <- str_to_upper(catcidents)
# View matches to word "DOG" again
str_view(catcidents_upper, pattern = whole_dog_pattern, match = TRUE)
Instruction 3:
# From previous steps
whole_dog_pattern <- whole_word("DOG")
catcidents_upper <- str_to_upper(catcidents)
# Which strings match?
has_dog <- str_detect(catcidents_upper, pattern = whole_dog_pattern)
# Pull out matching strings in original
catcidents[has_dog]
5.7 Ignoring case when matching
Instruction 1:
# View matches to "TRIP"
str_view(catcidents, pattern = "TRIP", match = TRUE)
# Construct case insensitive pattern
trip_pattern <- regex("TRIP", ignore_case = TRUE)
# View case insensitive matches to "TRIP"
str_view(catcidents, pattern = trip_pattern, match = TRUE)
Instruction 2:
# From previous step
trip_pattern <- regex("TRIP", ignore_case = TRUE)
# Get subset of matches
trip <- str_subset(catcidents, pattern = trip_pattern)
# Extract matches
str_extract(trip, pattern = trip_pattern)
5.8 Fixing case problems
Instruction:
library(stringi)
# Get first five catcidents
cat5 <- catcidents[1:5]
# Take a look at original
writeLines(cat5)
# Transform to title case
writeLines(str_to_title(cat5))
# Transform to title case with stringi
writeLines(stri_trans_totitle(cat5)) #same
# Transform to sentence case with stringi
writeLines(stri_trans_totitle(cat5, type = "sentence"))