Text-Mining-DataCamp-String Manipulation with stringr in R

1. String Basics

1.1 Welcome! (video)


Following the guidelines for using quotes, define the three strings, line1, line2 and line3:

  • Line 1: The table was a large one, but the three were all crowded together at one corner of it:
  • Line 2: "No room! No room!" they cried out when they saw Alice coming.
  • Line 3: "There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.
# Define line1
line1 <- "The table was a large one, but the three were all crowded together at one corner of it:"

# Define line2
line2 <- '"No room! No room!" they cried out when they saw Alice coming.'

# Define line3
line3 <- '"There\'s plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.'
1.3 What you see isn’t always what you have

We’ve put your lines from Alice’s Adventures in Wonderland in a vector called lines.

  • Take a look at lines to see R’s representation of the strings.
  • Pass lines to writeLines() to see the content of strings you’ve created.
  • By default writeLines() separates the strings with a newline, which you can change using the sep argument. Write lines to the screen again, but this time set the sep argument to a space, " ".
  • Finally, try using writeLines() on the string "hello\n\U1F30D". You’ll learn about what’s going on here in the next exercise.
# Putting lines in a vector
lines <- c(line1, line2, line3)

# Print lines

# Use writeLines() on lines

# Write lines with a space separator
writeLines(lines, sep = " ")

# Use writeLines() on the string "hello\n\U1F30D"
1.4 Escape sequences


  • Edit the string inside writeLines() so that it correctly displays (all on one line):
To have a \ you need \\
  • Edit the string inside writeLines() so that it correctly displays (with the line breaks in these positions)
This is a really 
really really 
long string
  • Try writeLines() with the string containing Unicode characters:
"\u0928\u092e\u0938\u094d\u0924\u0947 \u0926\u0941\u0928\u093f\u092f\u093e". You just said "Hello World" in Hindi!
# Should display: To have a \ you need \\
writeLines("To have a \\ you need \\\\")

# Should display: 
# This is a really 
# really really 
# long string
writeLines("This is a really \nreally \nreally \nlong string")

# Use writeLines() with 
# "\u0928\u092e\u0938\u094d\u0924\u0947 \u0926\u0941\u0928\u093f\u092f\u093e"
1.5 Turning numbers into strings (video)
1.6 Using format() with numbers


  • Format c(0.0011, 0.011, 1) with digits = 1. This is like the example described above.
  • Now, format c(1.0011, 2.011, 1) with digits = 1. Try to predict what you might get before you try it.
  • Format percent_change by choosing the digits argument so that the values are presented with one place after the decimal point.
  • Format income by choosing the digits argument so that the values are presented as whole numbers (i.e. no places after the decimal point).
  • Format p_values using a fixed representation.
# Some vectors of numbers
percent_change  <- c(4, -1.91, 3.00, -5.002)
income <-  c(72.19, 1030.18, 10291.93, 1189192.18)
p_values <- c(0.12, 0.98, 0.0000191, 0.00000000002)

# Format c(0.0011, 0.011, 1) with digits = 1
format(c(0.0011, 0.011, 1), digits = 1)

# Format c(1.0011, 2.011, 1) with digits = 1
format(c(1.0011, 2.011, 1), digits = 1)

# Format percent_change to one place after the decimal point
format(percent_change, digits = 2)

# Format income to whole numbers
format(income, digits = 2)

# Format p_values in fixed format
format(p_values, scientific = FALSE)
1.7 Controlling other aspects of the string formatC()

We’ve assigned your formatted income from the previous exercise to formatted_income.

  • Print formatted_income Notice the spaces at the start of the strings.
  • Call writeLines() on the formatted income. Notice how the numbers line up on the decimal point.
  • Define trimmed_income by using format() on income with digits = 2 and trim = TRUE.
  • Call writeLines() on trimmed_income. Notice how this removes the spaces at the start of the strings and the values line up on left.
  • Define pretty_income by using format() on income with digits = 2 and big.mark = ",".
  • Call writeLines() on pretty_income.
formatted_income <- format(income, digits = 2)

# Print formatted_income

# Call writeLines() on the formatted income

# Define trimmed_income
trimmed_income <- format(income, digits = 2, trim = TRUE)

# Call writeLines() on the trimmed_income

# Define pretty_income
pretty_income <- format(income, digits = 2, big.mark = ",")

# Call writeLines() on the pretty_income
1.8 formatC()


The vectors income, percent_change, and p_values are available in your workspace.

  • First, compare the behavior of formatC() to format() by calling formatC() on x with format = "f" and digits = 1. This is the same vector you used with format(), do you see the difference?
  • Call formatC() on y with format = "f" and digits = 1. Notice how digits has consistent behavior regardless of the vector you format.
  • Format percent_change to one decimal place after the decimal point.
  • Format percent_change to one decimal place after the decimal point and add flag = "+". This forces the display of the sign.
  • Format p_values using format = "g" and digits = 2. This can be useful, since if there are any p-values in scientific notation, they must be < 0.0001.
# From the format() exercise
x <- c(0.0011, 0.011, 1)
y <- c(1.0011, 2.011, 1)

# formatC() on x with format = "f", digits = 1
formatC(x, format = "f", digits = 1)

# formatC() on y with format = "f", digits = 1
formatC(y, format = "f", digits = 1)

# Format percent_change to one place after the decimal point
formatC(percent_change, format = "f", digits = 1)

# percent_change with flag = "+"
formatC(percent_change, format = "f", digits = 1, flag = "+")

# Format p_values using format = "g" and digits = 2
formatC(p_values, format = "g", digits = 2)
1.9 Putting strings together (video)
1.10 Annotation of numbers

We’ve put the formatted vectors pretty_income and pretty_percent in your workspace along with years.

  • Paste a $ to the front of each value in pretty_income, use sep = "", so there is no space between the $ and value.
  • Paste a % to the end of each value in pretty_percent, use sep = "", so there is no space between the value and the %.
  • years contains the year each pretty_percent corresponds to. Use paste() to produce a vector with elements like 2010: +4.0% and assign it to year_percent.
  • Use paste() with year_percent to create single string that collapses all the years: 2010: +4.0%, 2011: -1.9%, 2012: +3.0%, 2013: -5.0%
# Add % to pretty_percent
paste("$", pretty_income, sep = "")
# Add % to pretty_percent
paste(pretty_percent, "%", sep = "")

# Create vector with elements like 2010: +4.0%`
year_percent <- paste(years, ": ", pretty_percent, "%", sep = "")

# Collapse all years into single string
paste(year_percent, collapse = ",")
1.11 A very simple table

The income vector is loaded in your workspace.

  • Create pretty_income by using format() with digits = 2 and big.mark = ",".
  • Create dollar_income by pasting $ to pretty_income (don’t forget to set the sep argument).
  • Create formatted_names by using format() on income_names with justify = "right".
  • Create rows by pasting together formatted_names and dollar_income. Use three spaces as a separator to give some room between your columns. Be sure to surround your separator in ".
  • Call writeLines() on rows to see your table.


# Define the names vector
income_names <- c("Year 0", "Year 1", "Year 2", "Project Lifetime")

# Create pretty_income
pretty_income <- format(income, digits = 2, big.mark = ",")

# Create dollar_income
dollar_income <- paste("$", pretty_income, sep = "")

# Create formatted_names
formatted_names <- format(income_names, justify = "right")

# Create rows
rows <- paste(formatted_names, dollar_income, sep = "   ")

# Write rows
1.12 Let’s order pizza!


  • Print my_toppings to see your random toppings.
  • Add "and " to the start of the third element by using paste() with my_toppings and a vector you define.
  • Create a vector these_toppings by using paste() to collapse my_toppings_and with a comma and space between each element.
  • Create my_order by pasting "I want to order a pizza with " to these_toppings and ending with a period, ".".
  • Order your pizza by calling writeLines() on my_order.
  • Try re-running all your code (including the sampling of toppings). You should get a brand new pizza order!
# Randomly sample 3 toppings
my_toppings <- sample(toppings, size = 3)

# Print my_toppings

# Paste "and " to last element: my_toppings_and
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")

# Collapse with comma space: these_toppings
these_toppings <- paste(my_toppings_and, collapse = ", ")

# Add rest of sentence: my_order
my_order <- paste("I want to order a pizza with ", these_toppings, ".", sep = "")

# Order pizza with writeLines()

2. Introduction to Stringr

2.1 Introducing stringr (video)
2.2 Putting strings together with stringr



my_toppings <- c("cheese", NA, NA)
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")

# Print my_toppings_and

# Use str_c() instead of paste(): my_toppings_str
my_toppings_str <- str_c(c("", "", "and "), my_toppings, sep = "")

# Print my_toppings_str

# paste() my_toppings_and with collapse = ", "
paste(my_toppings_and, collapse = ", ")

# str_c() my_toppings_str with collapse = ", "
str_c(my_toppings_str, collapse = ", ")
2.3 String length



# Extracting vectors for boys' and girls' names
babynames_2014 <- filter(babynames, year == 2014)
boy_names <- filter(babynames_2014, sex == "M")$name
girl_names <- filter(babynames_2014, sex == "F")$name

# Take a look at a few boy_names

# Find the length of all boy_names
boy_length <- str_length(boy_names)

# Take a look at a few lengths

# Find the length of all girl_names
girl_length <- str_length(girl_names)

# Find the difference in mean length
mean(girl_length) - mean(boy_length)

# Confirm str_length() works with factors
2.4 Extracting substrings


# Extract first letter from boy_names
boy_first_letter <- str_sub(boy_names, 1, 1)

# Tabulate occurrences of boy_first_letter
# Extract the last letter in boy_names, then tabulate
boy_last_letter <- str_sub(boy_names, -1,-1)

# Extract the first letter in girl_names, then tabulate
girl_first_letter <- str_sub(girl_names, 1, 1)

# Extract the last letter in girl_names, then tabulate
girl_last_letter <- str_sub(girl_names, -1, -1)
2.5 Hunting for matches (video)
2.6 Detecting matches


# Look for pattern "zz" in boy_names
contains_zz <- str_detect(boy_names, pattern = fixed("zz"))

# Examine str() of contains_zz

# How many names contain "zz"?

# Which names contain "zz"?

# Which rows in boy_df have names that contain "zz"?
boy_df <- filter(babynames_2014, sex == "M")
2.7 Subsetting strings based on match


# Find boy_names that contain "zz"
str_subset(boy_names, pattern = fixed("zz"))

# Find girl_names that contain "zz"
str_subset(girl_names, pattern = fixed("zz"))

# Find girl_names that contain "U"
starts_U <- str_subset(girl_names, pattern = fixed("U"))

# Find girl_names that contain "U" and "z"
str_subset(starts_U, pattern = "z")
2.8 Counting matches


# Count occurrences of "a" in girl_names
number_as <- str_count(girl_names, pattern = fixed("a"))

# Count occurrences of "A" in girl_names
number_As <- str_count(girl_names, pattern = fixed("A"))

# Histograms of number_as and number_As

# Find total "a" + "A"
total_as <- number_As + number_as

# girl_names with more than 4 a's
girl_names[total_as > 4]
2.9 Splitting strings (video)
2.10 Parsing strings into variables

Instruction 1:

# Some date data
date_ranges <- c("23.01.2017 - 29.01.2017", "30.01.2017 - 06.02.2017")

# Split dates using " - "
split_dates <- str_split(date_ranges, pattern = fixed(" - "))

Instruction 2:

# Some date data
date_ranges <- c("23.01.2017 - 29.01.2017", "30.01.2017 - 06.02.2017")

# Split dates with n and simplify specified
split_dates_n <- str_split(date_ranges, pattern = fixed(" - "), simplify = TRUE, n = 2)

Instruction 3:

# From previous step
date_ranges <- c("23.01.2017 - 29.01.2017", "30.01.2017 - 06.02.2017")
split_dates_n <- str_split(date_ranges, fixed(" - "), n = 2, simplify = TRUE)

# Subset split_dates_n into start_dates and end_dates
start_dates <- split_dates_n[, 1]
end_dates <- split_dates_n[, 2]

# Split start_dates into day, month and year pieces
str_split(start_dates, pattern = fixed("."), simplify = TRUE)

Instruction 4:

both_names <- c("Box, George", "Cox, David")

# Split both_names into first_names and last_names
both_names_split <- str_split(both_names, pattern = fixed(", "), simplify = TRUE)

# Get first names
first_names <- both_names_split[, 2]

# Get last names
last_names <- both_names_split[, 1]
2.11 Some simple text statistics


# Split lines into words
words <- str_split(lines, pattern = fixed(" "))

# Number of words per line
lapply(words, length)
# Number of characters in each word
word_lengths <- lapply(words, str_length)
# Average word length per line
lapply(word_lengths, mean)
2.12 Replacing matches in strings (video)
2.13 Replacing to tidy strings

Instruction 1:

# Some IDs
ids <- c("ID#: 192", "ID#: 118", "ID#: 001")

# Replace "ID#: " with ""
id_nums <- str_replace(ids, "ID#: ", "")

# Turn id_nums into numbers
id_ints <- as.numeric(id_nums)

Instruction 2:

# Some (fake) phone numbers
phone_numbers <- c("510-555-0123", "541-555-0167")

# Use str_replace() to replace "-" with " "
str_replace(phone_numbers, "-", " ")

# Use str_replace_all() to replace "-" with " "
str_replace_all(phone_numbers, "-", " ")

# Turn phone numbers into the format xxx.xxx.xxxx
str_replace_all(phone_numbers, "-", ".")
2.14 Review


# Find the number of nucleotides in each sequence

# Find the number of A's occur in each sequence
str_count(genes, pattern = fixed("A"))

# Return the sequences that contain "TTTTTT"
str_subset(genes, pattern = fixed("TTTTTT"))

# Replace all the "A"s in the sequences with a "_"
str_replace_all(genes, pattern = fixed("A"), replacement = "_")
2.15 Final challenges

Instruction 1:

# Define some full names
names <- c("Diana Prince", "Clark Kent")

# Split into first and last names
names_split <- str_split(names, pattern = fixed(" "), simplify = TRUE)

# Extract the first letter in the first name
abb_first <- str_sub(names_split[, 1], 1, 1)

# Combine the first letter ". " and last name
str_c(abb_first,". ", names_split[,2])

Instruction 2:

# Use all names in babynames_2014
all_names <- babynames_2014$name

# Get the last two letters of all_names
last_two_letters <- str_sub(all_names, -2, -1)

# Does the name end in "ee"?
ends_in_ee <- str_detect(last_two_letters, pattern = fixed("ee"))

# Extract rows and "sex" column
sex <- babynames_2014$sex[ends_in_ee]

# Display result as a table

3. Pattern Matching with Regular Expressions

3.1 Regular expressions (video)
3.2 Matching the start or end of the string

Instruction 1:

# Some strings to practice with
x <- c("cat", "coat", "scotland", "tic toc")

# Print END

# Run me
str_view(x, pattern = START %R% "c")

Instruction 2:

# Match the strings that start with "co" 
str_view(x, pattern = START %R% "co")

Instruction 3:

# Match the strings that end with "at"
str_view(x, pattern = 
"at" %R% END)

Instruction 4:

# Match the strings that is exactly "cat"
str_view(x, pattern = START %R% "cat" %R% END)
3.3 Matching any character

Instruction 1:

# Match two characters, where the second is a "t"
str_view(x, pattern = ANY_CHAR %R% "t")

Instruction 2:

# Match a "t" followed by any character
str_view(x, pattern = "t" %R% ANY_CHAR)

Instruction 3:

# Match two characters
str_view(x, pattern = ANY_CHAR %R% ANY_CHAR)

Instruction 4:

# Match a string with exactly three characters
str_view(x, pattern = START %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% END)
3.4 Combining with stringr functions

Instruction 1:

pattern <- "q" %R% ANY_CHAR

# Find names that have the pattern
names_with_q <- str_subset(boy_names, pattern)

# How many names were there?

Instruction 2:

# Find part of name that matches pattern
part_with_q <- str_extract(boy_names, pattern)

# Get a table of counts

Instruction 3:

# Did any names have the pattern more than once?
count_of_q <- str_count(boy_names, pattern)

# Get a table of counts

Instruction 4:

# Which babies got these names? (get logical vector back)
with_q <- str_detect(boy_names, pattern)

# What fraction of babies got these names? (get mean)
3.5 More regular expressions (video)
3.6 Alternation

Instruction 1:

# Match Jeffrey or Geoffrey
whole_names <- or("Jeffrey", "Geoffrey")
str_view(boy_names, pattern = whole_names, 
  match = TRUE) 

Instruction 2:

# Match Jeffrey or Geoffrey, another way
common_ending <- or("Je", "Geo") %R% "ffrey"
str_view(boy_names, pattern = common_ending, 
  match = TRUE)

Instruction 3:

# Match with alternate endings
by_parts <- or("Je", "Geo") %R% "ff" %R% or("ry", "ery", "rey", "erey")
  pattern = by_parts, 
  match = TRUE)

Instruction 4:

# Match names that start with Cath or Kath
ckath <- or("C", "K") %R% "ath"
str_view(girl_names, pattern = ckath, match = TRUE)
3.7 Character classes

Instruction 1:

# Create character class containing vowels
vowels <- char_class("aeiouAEIOU")

# Print vowels

# See vowels in x with str_view()
str_view(x, vowels)

Instruction 2:

# See vowels in x with str_view_all()
str_view_all(x, vowels)

Instruction 3:

# Number of vowels in boy_names
num_vowels <-  str_count(boy_names, vowels)

# Number of characters in boy_names
name_length <- str_length(boy_names)

Instruction 4:

# Calc mean number of vowels

# Calc mean fraction of vowels per name
3.8 Repetition

Instruction 1:

# Vowels from last exercise
vowels <- char_class("aeiouAEIOU")

# See names with only vowels
  pattern = exactly(one_or_more(vowels)), 
  match = TRUE)

Instruction 2:

# Use `negated_char_class()` for everything but vowels
not_vowels <- negated_char_class("aeiouAEIOU")

# See names with no vowels
  pattern = exactly(one_or_more(not_vowels)), 
  match = TRUE)
3.9 Shortcuts (video)
3.10 Hunting for phone numbers

Instruction 1:

# Create a three digit pattern and test
three_digits <- DGT %R% DGT %R% DGT

# Test it
str_view_all(contact, pattern = three_digits)

Instruction 2:

# Create a separator pattern and test
separator <-  char_class("-.() ")

# Test it
str_view_all(contact, pattern = separator)

Instruction 3:

# Use these components
three_digits <- DGT %R% DGT %R% DGT
four_digits <- three_digits %R% DGT
separator <- char_class("-.() ")

# Create phone pattern
phone_pattern <- optional(OPEN_PAREN) %R%
  three_digits %R%
  zero_or_more(separator) %R%
  three_digits %R% 
  zero_or_more(separator) %R%
# Test pattern           
str_view_all(contact, phone_pattern)

Instruction 4:

# Use this pattern
three_digits <- DGT %R% DGT %R% DGT
four_digits <- three_digits %R% DGT
separator <- char_class("-.() ")
phone_pattern <- optional(OPEN_PAREN) %R% 
  three_digits %R% 
  zero_or_more(separator) %R% 
  three_digits %R% 
  zero_or_more(separator) %R%
# Extract phone numbers
str_extract(contact, phone_pattern)

# Extract ALL phone numbers
str_extract_all(contact, phone_pattern)
3.11 Extracting age and gender from accident narratives

Instruction 1:

# Pattern to match one or two digits
age <- DGT %R% optional(DGT)

# Test it
str_view(narratives, pattern = age)

Instruction 2:

# Use this pattern
age <- DGT %R% optional(DGT)

# Pattern to match units 
unit <- optional(SPC) %R% or("YO", "YR", "MO")

# Test pattern with age then units
         pattern = age %R% unit)

Instruction 3:

# Use these patterns
age <- DGT %R% optional(DGT)
unit <- optional(SPC) %R% or("YO", "YR", "MO")

# Pattern to match gender
gender <- optional(SPC) %R% char_class("MF")

# Test pattern with age then units then gender
         pattern = age %R% unit %R% gender)

Instruction 4:

# Use these patterns
age <- DGT %R% optional(DGT)
unit <- optional(SPC) %R% or("YO", "YR", "MO")
gender <- optional(SPC) %R% or("M", "F")

# Extract age_gender, take a look
age_gender <- str_extract(narratives, pattern = age %R% unit %R% gender)
3.12 Parsing age and gender into pieces

Instruction 1:

# age_gender, age, gender, unit are pre-defined

# Extract age and make numeric
as.numeric(str_extract(age_gender, age))

Instruction 2:

# Replace age and units with ""
genders <- str_remove(age_gender, pattern = age %R% unit)

# Replace extra spaces
str_remove_all(genders, pattern = one_or_more(SPC))

Instruction 3:

# Numeric ages, from previous step
ages_numeric <- as.numeric(str_extract(age_gender, age))

# Extract units 
time_units <- str_extract(age_gender, unit)

# Extract first word character
time_units_clean <- str_extract(time_units, WRD)

# Turn ages in months to years
ifelse(time_units_clean == "Y", ages_numeric, ages_numeric / 12)

4. More Advanced Matching and Manipulation

4.1 Capturing (video)
4.2 Capturing parts of a pattern

Instruction 1:

# Capture part between @ and . and after .
email <- capture(one_or_more(WRD)) %R% 
  "@" %R% capture(one_or_more(WRD)) %R% 
  DOT %R% capture(one_or_more(WRD))

# Check match hasn't changed
str_view(hero_contacts, pattern = email)

Instruction 2:

# Pattern from previous step
email <- capture(one_or_more(WRD)) %R% 
  "@" %R% capture(one_or_more(WRD)) %R% 
  DOT %R% capture(one_or_more(WRD))
# Pull out match and captures
email_parts <- str_match(hero_contacts, pattern = email)

# Save host
host <- email_parts[, 3]
4.3 Pulling out parts of a phone number

Instruction 1:

# View text containing phone numbers

# Add capture() to get digit parts
phone_pattern <- capture(three_digits) %R% zero_or_more(separator) %R% 
           capture(three_digits) %R% zero_or_more(separator) %R%
# Pull out the parts with str_match()
phone_numbers <- str_match(contact, phone_pattern)

# Put them back together
  phone_numbers[, 2],
  ") ",
  phone_numbers[, 3],
  phone_numbers[, 4])
4.4 Extracting age and gender again

Instruction 1:

# narratives has been pre-defined

# Add capture() to get age, unit and sex
pattern <- capture(optional(DGT) %R% DGT) %R%  
  optional(SPC) %R% capture(or("YO", "YR", "MO")) %R%
  optional(SPC) %R% capture(or("M", "F"))

# Pull out from narratives
str_match(narratives, pattern = pattern)

Instruction 2:

# Edit to capture just Y and M in units
pattern2 <- capture(optional(DGT) %R% DGT) %R%  
  optional(SPC) %R% capture(or("Y", "M")) %R% optional(or("O","R")) %R%
  optional(SPC) %R% capture(or("M", "F"))

# Check pattern
str_view(narratives, pattern = pattern2)

# Pull out pieces
str_match(narratives, pattern = pattern2)
4.5 Backreferences (video)
4.6 Using backreferences in patterns

Instruction 1:

# See names with three repeated letters
repeated_three_times <- capture(LOWER) %R% REF1 %R% REF1

# Test it
str_view(boy_names, pattern = repeated_three_times, match = TRUE)

Instruction 2:

# See names with a pair of repeated letters, egeg. abab
pair_of_repeated <- capture(LOWER %R% LOWER) %R% REF1

# Test it
str_view(boy_names, pattern = pair_of_repeated, match = TRUE)

Instruction 3:

# See names with a pair that reverses, e.g. abba
pair_that_reverses <- capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1

# Test it
str_view(boy_names, pattern = pair_that_reverses, match = TRUE)

Instruction 4:

# Four letter palindrome names
four_letter_palindrome <- exactly(
  capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1

# Test it
str_view(boy_names, pattern = four_letter_palindrome, match = TRUE)
4.7 Replacing with regular expressions


# View text containing phone numbers

# Replace digits with "X"
str_replace(contact, pattern = DGT, replacement = "X")

# Replace all digits with "X"
str_replace_all(contact, pattern = DGT, replacement = "X")

# Replace all digits with different symbol
str_replace_all(contact, pattern = DGT, 
  replacement = c("X", ".", "*", "_"))
4.8 Replacing with backreferences


# Build pattern to match words ending in "ING"
pattern <- one_or_more(WRD) %R% "ING"
str_view(narratives, pattern)

# Test replacement
str_replace(narratives, capture(pattern), 
  str_c("CARELESSLY", REF1, sep = " "))

# One adverb per narrative
adverbs_10 <- sample(adverbs, 10)
4.9 Unicode and pattern matching (video)
4.10 Matching a specific code point or code groups


# Names with builtin accents
(tay_son_builtin <- c(
  "Nguy\u1ec5n Nh\u1ea1c", 
  "Nguy\u1ec5n Hu\u1ec7",
  "Nguy\u1ec5n Quang To\u1ea3n"

# Convert to separate accents
tay_son_separate <- stri_trans_nfd(tay_son_builtin)

# Verify that the string prints the same

# Match all accents
str_view_all(tay_son_separate, pattern = UP_DIACRITIC)
4.11 Matching a single grapheme

Instruction 1:

# tay_son_separate has been pre-defined

# View all the characters in tay_son_separate
str_view_all(tay_son_separate, pattern = ANY_CHAR)

Instruction 2:

# View all the graphemes in tay_son_separate
str_view_all(tay_son_separate, pattern =  GRAPHEME)

Instruction 3:

# Combine the diacritics with their letters
tay_son_builtin <- stri_trans_nfc(tay_son_separate)

# View all the graphemes in tay_son_builtin
str_view_all(tay_son_builtin, pattern = GRAPHEME)

5. Case Studies

5.1 A case study, reading a play (video)
5.2 Getting the play into R

Instruction 1:

# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)

Instruction 2:

# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)

# Detect start and end lines
start <- which(str_detect(earnest, fixed("START OF THE PROJECT")))
end <- which(str_detect(earnest, fixed("END OF THE PROJECT")))

# Get rid of gutenberg intro text
earnest_sub  <- earnest[(start + 1):(end - 1)]

Instruction 3:

# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)

# Detect start and end lines
start <- str_which(earnest, fixed("START OF THE PROJECT"))
end <- str_which(earnest, fixed("END OF THE PROJECT"))

# Get rid of gutenberg intro text
earnest_sub  <- earnest[(start + 1):(end - 1)]

# Detect first act

lines_start <- which(str_detect(earnest_sub, fixed("FIRST ACT")))

# Set up index
intro_line_index <- 1:(lines_start - 1)

# Split play into intro and play
intro_text <- earnest_sub[intro_line_index]
play_text <- earnest_sub[-intro_line_index

Instruction 4:

# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)

# Detect start and end lines
start <- str_which(earnest, fixed("START OF THE PROJECT"))
end <- str_which(earnest, fixed("END OF THE PROJECT"))

# Get rid of gutenberg intro text
earnest_sub  <- earnest[(start + 1):(end - 1)]

# Detect first act
lines_start <- str_which(earnest_sub, fixed("FIRST ACT"))

# Set up index
intro_line_index <- 1:(lines_start - 1)

# Split play into intro and play
intro_text <- earnest_sub[intro_line_index]
play_text <- earnest_sub[-intro_line_index]

# Take a look at the first 20 lines
5.3 Identifying the lines, take 1

Instruction 1:

# Pattern for start word then .
pattern_1 <- START %R% one_or_more(WRD) %R% DOT

# Test pattern_1
str_view(play_lines, pattern = pattern_1, 
  match = TRUE) #to see matched lines
str_view(play_lines, pattern = pattern_1, 
  match = FALSE) 

Instruction 2:

# Pattern for start, capital, word then .
pattern_2 <- START %R% ascii_upper() %R% one_or_more(WRD) %R% DOT

# Test pattern_2
str_view(play_lines, pattern_2, match = TRUE)
str_view(play_lines, pattern_2, match = FALSE)

Instruction 3:

# Pattern from last step
pattern_2 <- START %R% ascii_upper() %R% one_or_more(WRD) %R% DOT

# Get subset of lines that match
lines <- str_subset(play_lines, pattern = pattern_2)

# Extract match from lines
who <- str_extract(lines, pattern = pattern_2)

# Let's see what we have
5.4 Identifying the lines, take 2

Instruction 1:

# Create vector of characters
characters <- c("Algernon", "Jack", "Lane", "Cecily", "Gwendolen", "Chasuble", 
  "Merriman", "Lady Bracknell", "Miss Prism")

# Match start, then character name, then .
pattern_3 <- START %R% or1(characters) %R% DOT

# View matches of pattern_3
str_view(play_lines, pattern = pattern_3, match = TRUE)
# View non-matches of pattern_3
str_view(play_lines, pattern = pattern_3, match = FALSE)

Instruction 2:

# Variables from previous step
characters <- c("Algernon", "Jack", "Lane", "Cecily", "Gwendolen", "Chasuble", 
  "Merriman", "Lady Bracknell", "Miss Prism")
pattern_3 <- START %R% or1(characters) %R% DOT

# Pull out matches
lines <- str_subset(play_lines, pattern = pattern_3)

# Extract match from lines
who <- str_extract(lines, pattern = pattern_3)

# Let's see what we have

# Count lines per character
5.5 A case study on case (video)
5.6 Changing case to ease matching

Instruction 1:

# catcidents has been pre-defined

# Construct pattern of DOG in boundaries
whole_dog_pattern <- whole_word("DOG")

# View matches to word "DOG"
str_view(catcidents, pattern = whole_dog_pattern, match = TRUE)

Instruction 2:

# From previous step
whole_dog_pattern <- whole_word("DOG")

# Transform catcidents to upper case
catcidents_upper <- str_to_upper(catcidents)

# View matches to word "DOG" again
str_view(catcidents_upper, pattern = whole_dog_pattern, match = TRUE)

Instruction 3:

# From previous steps
whole_dog_pattern <- whole_word("DOG")
catcidents_upper <- str_to_upper(catcidents)

# Which strings match?
has_dog <- str_detect(catcidents_upper, pattern = whole_dog_pattern)

# Pull out matching strings in original 
5.7 Ignoring case when matching

Instruction 1:

# View matches to "TRIP"
str_view(catcidents, pattern = "TRIP", match = TRUE)

# Construct case insensitive pattern
trip_pattern <- regex("TRIP", ignore_case = TRUE)

# View case insensitive matches to "TRIP"
str_view(catcidents, pattern = trip_pattern, match = TRUE)

Instruction 2:

# From previous step
trip_pattern <- regex("TRIP", ignore_case = TRUE)

# Get subset of matches
trip <- str_subset(catcidents, pattern = trip_pattern)

# Extract matches
str_extract(trip, pattern = trip_pattern)
5.8 Fixing case problems



# Get first five catcidents
cat5 <- catcidents[1:5]

# Take a look at original

# Transform to title case

# Transform to title case with stringi
writeLines(stri_trans_totitle(cat5)) #same

# Transform to sentence case with stringi
writeLines(stri_trans_totitle(cat5, type = "sentence"))
5.9 Wrapping up
5.10 An interview with Hadley Wickham
Key Features Develop all the relevant skills for building text-mining apps with R with this easy-to-follow guide Gain in-depth understanding of the text mining process with lucid implementation in the R language Example-rich guide that lets you gain high-quality information from text data Book Description Text Mining (or text data mining or text analytics) is the process of extracting useful and high-quality information from text by devising patterns and trends. R provides an extensive ecosystem to mine text through its many frameworks and packages. Starting with basic information about the statistics concepts used in text mining, this book will teach you how to access, cleanse, and process text using the R language and will equip you with the tools and the associated knowledge about different tagging, chunking, and entailment approaches and their usage in natural language processing. Moving on, this book will teach you different dimensionality reduction techniques and their implementation in R. Next, we will cover pattern recognition in text data utilizing classification mechanisms, perform entity recognition, and develop an ontology learning framework. By the end of the book, you will develop a practical application from the concepts learned, and will understand how text mining can be leveraged to analyze the massively available data on social media. What you will learn Get acquainted with some of the highly efficient R packages such as OpenNLP and RWeka to perform various steps in the text mining process Access and manipulate data from different sources such as JSON and HTTP Process text using regular expressions Get to know the different approaches of tagging texts, such as POS tagging, to get started with text analysis Explore different dimensionality reduction techniques, such as Principal Component Analysis (PCA), and understand its implementation in R Discover the underlying themes or topics that are present in an unstructured collection of documents, using common topic models such as Latent Dirichlet Allocation (LDA) Build a baseline sentence completing application Perform entity extraction and named entity recognition using R About the Author Ashish Kumar is an IIM alumnus and an engineer at heart. He has extensive experience in data science, machine learning, and natural language processing having worked at organizations, such as McAfee-Intel, an ambitious data science startup Volt consulting), and presently associated to the software and research lab of a leading MNC. Apart from work, Ashish also participates in data science competitions at Kaggle in his spare time. Avinash Paul is a programming language enthusiast, loves exploring open sources technologies and programmer by choice. He has over nine years of programming experience. He has worked in Sabre Holdings , McAfee , Mindtree and has experience in data-driven product development, He was intrigued by data science and data mining while developing niche product in education space for a ambitious data science start-up. He believes data science can solve lot of societal challenges. In his spare time he loves to read technical books and teach underprivileged children back home. Table of Contents Chapter 1. Statistical Linguistics with R Chapter 2. Processing Text Chapter 3. Categorizing and Tagging Text Chapter 4. Dimensionality Reduction Chapter 5. Text Summarization and Clustering Chapter 6. Text Classification Chapter 7. Entity Recognition




