Text-Mining-DataCamp-Text Mining: Bag of Words
1. Jumping into Text Mining with Bag of Words
1.1 What is text mining? (video)
1.2 Understanding text mining
1.3 Quick taste of text mining
Instruction:
We’ve created an object in your workspace called new_text
containing several sentences.
- Load the
qdap
package. - Print
new_text
to the console. - Create
term_count
consisting of the 10 most frequent terms innew_text
. - Plot a bar chart with the results of
term_count
.
# Load qdap
library(qdap)
# Print new_text to the console
new_text
# Find the 10 most frequent terms: term_count
term_count <- freq_terms(new_text, 10)
# Plot term_count
plot(term_count)
1.4 Getting started (video)
1.5 Load some text
The data has been loaded for you and is available in coffee_data_file
.
- Create a new object
tweets
usingread.csv()
on the filecoffee_data_file
, which contains tweets mentioning coffee. Remember to addstringsAsFactors = FALSE
! - Examine the
tweets
object usingstr()
to determine which column has the text you’ll want to analyze. - Make a new
coffee_tweets
object using only the text column you identified earlier. To do so, use the$
operator and column name.
Instruction:
# Import text data
tweets <- read.csv('https://assets.datacamp.com/production/course_935/datasets/coffee.csv', stringsAsFactors = F)
# View the structure of tweets
str(tweets)
# Isolate text from tweets: coffee_tweets
coffee_tweets <- tweets$text
1.6 Make the vector a VCorpus object (1)
Instruction:
- Load the
tm
package. - Create a Source object from the
coffee_tweets
vector. Call this new objectcoffee_source
.
# Load tm
library(tm)
# Make a vector source from coffee_tweets
coffee_source <- VectorSource(coffee_tweets)
1.7 Make the vector a VCorpus object (2)
Instruction:
- Call the
VCorpus()
function on thecoffee_source
object to createcoffee_corpus
. - Verify
coffee_corpus
is aVCorpus
object by printing it to the console. - Print the 15th element of
coffee_corpus
to the console to verify that it’s aPlainTextDocument
that contains the content and metadata of the 15th tweet. Use double bracket subsetting. - Print the content of the 15th tweet in
coffee_corpus
. Use double brackets to select the proper tweet, followed by single brackets to extract the content of that tweet. - Print the
content()
of the 10th tweet withincoffee_corpus
## coffee_source is already in your workspace
# Make a volatile corpus: coffee_corpus
coffee_corpus <- VCorpus(coffee_source)
# Print out coffee_corpus
coffee_corpus
# Print the 15th tweet in coffee_corpus
coffee_corpus[[15]]
# Print the contents of the 15th tweet in coffee_corpus
coffee_corpus[[15]]$content
# Now use content to review plain text of the 10th tweet
content(coffee_corpus[[10]])
1.8 Make a VCorpus from a data frame
Instruction:
In your workspace, there’s a simple data frame called example_text
with the correct column names and some metadata. There is also vec_corpus
which is a volatile corpus made with VectorSource()
- Create
df_source
usingDataframeSource()
with the example_text. - Create
df_corpus
by convertingdf_source
to a volatile corpus object withVCorpus()
. - Print out
df_corpus
. Notice how many documents it contains and the number of retained document level metadata points. - Use
meta()
ondf_corpus
to print the document associated metadata. - Examine the pre-loaded
vec_corpus
object. Compare the number of documents todf_corpus
. - Use
meta()
onvec_corpus
to compare any metadata found betweenvec_corpus
anddf_corpus
.
# Create a DataframeSource from the example text
df_source <- DataframeSource(example_text)
# Convert df_source to a volatile corpus
df_corpus <- VCorpus(df_source)
# Examine df_corpus
df_corpus
# Examine df_corpus metadata
meta(df_corpus)
# Compare the number of documents in the vector source
vec_corpus
# Compare metadata in the vector corpus
meta(vec_corpus)
1.9 Cleaning and preprocessing text (video)
1.10 Common cleaning functions from tm
Instruction:
Apply each of the following functions to text
, simply printing results to the console:
tolower()
removePunctuation()
removeNumbers()
stripWhitespace()
# Create the object: text
text <- "<b>She</b> woke up at 6 A.M. It\'s so early! She was only 10% awake and began drinking coffee in front of her computer."
# Make lowercase
tolower(text)
# Remove punctuation
removePunctuation(text)
# Remove numbers
removeNumbers(text)
# Remove whitespace
stripWhitespace(text)
1.11 Cleaning with qdap
Instruction:
Apply the following functions to the text
object from the previous exercise:
- bracketX()
- replace_number()
- replace_abbreviation()
- replace_contraction()
- replace_symbol()
## text is still loaded in your workspace
# Remove text within brackets
bracketX(text)
# Replace numbers with words
replace_number(text)
# Replace abbreviations
replace_abbreviation(text)
# Replace contractions
replace_contraction(text)
# Replace symbols with words
replace_symbol(text)
1.12 All about stop words
Instruction:
- Review standard stop words by calling
stopwords("en")
. - Remove “en” stopwords from
text
. - Add “coffee” and “bean” to the standard stop words, assigning to
new_stops
. - Remove the customized stopwords,
new_stops
, fromtext
.
## text is preloaded into your workspace
# List standard English stop words
stopwords("en")
# Print text without standard stop words
removeWords(text, stopwords("en"))
# Add "coffee" and "bean" to the list: new_stops
new_stops <- c("coffee", "bean", stopwords("en"))
# Remove stop words from text
removeWords(text, new_stops)
1.13 Intro to word stemming and stem completion
Instruction:
- Create a vector called
complicate
consisting of the words “complicated”, “complication”, and “complicatedly” in that order. - Store the stemmed version of
complicate
to an object calledstem_doc
. - Create comp_dict that contains one word, “complicate”.
- Create
complete_text
by applyingstemCompletion()
tostem_doc
. Re-complete the words usingcomp_dict
as the reference corpus. - Print
complete_text
to the console.
# Create complicate
complicate <- c("complicated", "complication", "complicatedly")
# Perform word stemming: stem_doc
stem_doc <- stemDocument(complicate)
# Create the completion dictionary: comp_dict
comp_dict <- c("complicate")
# Perform stem completion: complete_text
complete_text <- stemCompletion(stem_doc, comp_dict)
# Print complete_text
complete_text
1.14 Word stemming and stem completion on a sentence
Instruction:
The document text_data
and the completion dictionary comp_dict
are loaded in your workspace.
- Remove the punctuation marks in
text_data
usingremovePunctuation()
, assigning torm_punc
. - Call
strsplit()
onrm_punc
with thesplit
argument set equal to" "
. Nest this insideunlist()
, assigning ton_char_vec
. - Use
stemDocument()
again to perform word stemming onn_char_vec
, assigning tostem_doc
. - Create
complete_doc
by re-completing your stemmed document withstemCompletion()
and usingcomp_dict
as your reference corpus.
Are stem_doc
and complete_doc
what you expected?
# Remove punctuation: rm_punc
rm_punc <- removePunctuation(text_data)
# Create character vector: n_char_vec
n_char_vec <- unlist(strsplit(rm_punc, split = ' '))
# Perform word stemming: stem_doc
stem_doc <- stemDocument(n_char_vec)
# Print stem_doc
stem_doc
# Re-complete stemmed document: complete_doc
complete_doc <- stemCompletion(stem_doc, comp_dict)
# Print complete_doc
complete_doc
1.15 Apply preprocessing steps to a corpus
Instruction 1:
- Edit the custom function
clean_corpus()
in the sample code to apply (in order):tm
'sremovePunctuation()
.- Base R’s
tolower()
. - Append
"mug"
to the stop words list. tm
'sstripWhitespace()
.
# Alter the function code to match the instructions
clean_corpus <- function(corpus) {
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Transform to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Add more stopwords
corpus <- tm_map(corpus, removeWords, words = c(stopwords("en"), "coffee", "mug"))
# Strip whitespace
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
Instruction 2:
- Create
clean_corp
by applyingclean_corpus()
to the included corpustweet_corp
. - Print the cleaned 227th tweet in
clean_corp
using indexing[[227]]
andcontent()
. - Compare it to the original tweet from
tweets$text
tweet using[227]
.
# Alter the function code to match the instructions
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, words = c(stopwords("en"), "coffee", "mug"))
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
# Apply your customized function to the tweet_corp: clean_corp
clean_corp <- clean_corpus(tweet_corp)
# Print out a cleaned up tweet
clean_corp[[227]][1]
# Print out the same tweet in original form
tweet_corp[[227]][1]
1.16 The TDM & DTM (video)
1.17 Understanding TDM and DTM
1.18 Make a document-term matrix
Instruction:
- Create
coffee_dtm
by applyingDocumentTermMatrix()
toclean_corp
. - Create
coffee_m
, a matrix version ofcoffee_dtm
, usingas.matrix()
. - Print the dimensions of
coffee_m
to the console using thedim()
function. Note the number of rows and columns. - Print the subset of
coffee_m
containing documents (rows) 25 through 35 and terms (columns)"star"
and"starbucks"
.
# Create the document-term matrix from the corpus
coffee_dtm <- DocumentTermMatrix(clean_corp)
# Print out coffee_dtm data
coffee_dtm
# Convert coffee_dtm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_dtm)
# Print the dimensions of coffee_m
dim(coffee_m)
# Review a portion of the matrix to get some Starbucks
coffee_m[25:35, c("star", "starbucks")]
1.19 Make a term-document matrix
Instruction:
- Create
coffee_tdm
by applyingTermDocumentMatrix()
toclean_corp
. - Create
coffee_m
by convertingcoffee_tdm
to a matrix usingas.matrix()
. - Print the dimensions of
coffee_m
to the console. Note the number of rows and columns. - Print the subset of
coffee_m
containing terms (rows)"star"
and"starbucks"
and documents (columns) 25 through 35.
# Create a term-document matrix from the corpus
coffee_tdm <- TermDocumentMatrix(clean_corp)
# Print coffee_tdm data
coffee_tdm
# Convert coffee_tdm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_tdm)
# Print the dimensions of the matrix
dim(coffee_m)
# Review a portion of the matrix
coffee_m[c("star", "starbucks"), 25:35]
2. Word Clouds and More Interesting Visuals
2.1 Common text mining visuals (video)
2.2 Test your understanding of text mining
2.3 Frequent terms with tm
Instruction:
- Create
coffee_m
as a matrix using the term-document matrixcoffee_tdm
from the last chapter. - Create
term_frequency
using therowSums()
function oncoffee_m
. - Sort
term_frequency
in descending order and store the result interm_frequency
. - Use single square bracket subsetting, i.e. using only one
[
, to print the top 10 terms fromterm_frequency
. - Make a barplot of the top 10 terms.
## coffee_tdm is still loaded in your workspace
# Convert coffee_tdm to a matrix
coffee_m <- as.matrix(coffee_tdm)
# Calculate the row sums of coffee_m
term_frequency <- rowSums(coffee_m)
# Sort term_frequency in decreasing order
term_frequency <- sort(term_frequency, decreasing = T)
# View the top 10 most common words
term_frequency[1:10]
# Plot a barchart of the 10 most common words
barplot(term_frequency[1:10], col = "tan", las = 2)
2.4 Frequent terms with qdap
Instruction 1:
- Create
frequency
using thefreq_terms()
function ontweets$text
. Include arguments to accomplish the following:- Limit to the top 10 terms.
- At least 3 letters per term.
- Use
"Top200Words"
to define stop words.
- Produce a
plot()
of thefrequency
object. Compare it to the plot you produced in the previous exercise.
# Create frequency
frequency <- freq_terms(
tweets$text,
top = 10,
at.least = 3,
stopwords = "Top200Words"
)
# Make a frequency barchart
plot(frequency)
Instruction 2:
- Again, create
frequency
using thefreq_terms()
function ontweets$text
. Include the following arguments:- Limit to the top 10 terms.
- At least 3 letters per term.
- This time use
stopwords("english")
to define stop words.
- Produce a
plot()
offrequency
. Compare it to the plot offrequency
. Do certain words change based on the stop words criterion?
# Create frequency
frequency <- freq_terms(tweets$text,
top = 10,
at.least = 3,
stopwords = stopwords("english")
)
# Make a frequency barchart
plot(frequency)
2.5 Intro to word clouds (video)
2.6 A simple word cloud
Instruction:
- Load the
wordcloud
package. - Print out first 10 entries in
term_frequency
. - Extract the terms using
names()
onterm_frequency
. Call the vector of stringsterms_vec
. - Create a
wordcloud()
usingterms_vec
as the words, andterm_frequency
as the values. Add the parametersmax.words = 50
andcolors = "red"
.
# Load wordcloud package
library(wordcloud)
# Print the first 10 entries in term_frequency
term_frequency[1:10]
# Vector of terms
terms_vec <- names(term_frequency)
# Create a wordcloud for the values in word_freqs
wordcloud(terms_vec, term_frequency, max.words = 50, colors = "red")
2.7 Stop words and word clouds
Instruction:
- Apply
content()
to the 24th document inchardonnay_corp
. - Append
"chardonnay"
to the English stopwords, assigning tostops
. - Examine the last 6 words in
stops
. - Create
cleaned_chardonnay_corp
withtm_map()
by passing in thechardonnay_corp
, the functionremoveWords
and finally the stopwords,stops
. - Now examine the
content
of the24
tweet again to compare results.
# Review a "cleaned" tweet
content(chardonnay_corp[[24]])
# Add to stopwords
stops <- c(stopwords(kind = 'en'), 'chardonnay')
# Review last 6 stopwords
tail(stops)
# Apply to a corpus
cleaned_chardonnay_corp <- tm_map(chardonnay_corp, removeWords, stops)
# Review a "cleaned" tweet again
content(cleaned_chardonnay_corp[[24]])
2.8 Plot the better word cloud
Instruction:
We’ve loaded the wordcloud
package for you behind the scenes and will do so for all additional exercises requiring it.
- Sort the values in
chardonnay_words
withdecreasing = TRUE
. Save assorted_chardonnay_words
. - Look at the top 6 words in
sorted_chardonnay_words
and their values. - Create
terms_vec
usingnames()
onchardonnay_words
. - Pass
terms_vec
andchardonnay_words
into thewordcloud()
function. Review what other words pop out now that “chardonnay” is removed.
# Sort the chardonnay_words in descending order
sorted_chardonnay_words <- sort(chardonnay_words, decreasing = TRUE)
# Print the 6 most frequent chardonnay terms
head(sorted_chardonnay_words)
# Get a terms vector
terms_vec <- names(chardonnay_words)
# Create a wordcloud for the values in word_freqs
wordcloud(terms_vec, chardonnay_words, max.words = 50, colors = "red")
2.9 Improve word cloud colors
Instruction:
- Call the
colors()
function to list all basic colors. - Create a
wordcloud()
using the predefinedchardonnay_freqs
with the colors “grey80”, “darkgoldenrod1”, and “tomato”. Include the top 100 terms usingmax.words
.
# Print the list of colors
colors()
# Print the wordcloud with the specified colors
wordcloud(chardonnay_freqs$term,
chardonnay_freqs$num,
max.words = 100,
colors = c("grey80","darkgoldenrod1", "tomato"))
2.10 Use prebuilt color palettes
Instruction:
- Use
cividis()
to select5
colors in an object calledcolor_pal
. - Review the hexadecimal colors by printing
color_pal
to your console. - Create a
wordcloud()
from thechardonnay_freqs
term
andnum
columns. Include the top 100 terms usingmax.words
, and set thecolors
to your palette,color_pal
.
# Select 5 colors
color_pal <- cividis(5)
# Examine the palette output
color_pal
# Create a wordcloud with the selected palette
wordcloud(chardonnay_freqs$term,
chardonnay_freqs$num,
max.words = 100,
colors = color_pal)
2.11 Other word clouds and word networks (video)
2.12 Find common words
Instruction:
- Create
all_coffee
by usingpaste()
withcollapse = " "
oncoffee_tweets$text
. - Create
all_chardonnay
by usingpaste()
withcollapse = " "
onchardonnay_tweets$text
. - Create
all_tweets
usingc()
to combineall_coffee
andall_chardonnay
. Makeall_coffee
the first term. - Convert
all_tweets
usingVectorSource()
. - Create
all_corpus
by usingVCorpus()
onall_tweets
.
# Create all_coffee
all_coffee=paste(coffee_tweets$text,collapse=" ")
# Create all_chardonnay
all_chardonnay=paste(chardonnay_tweets$text,collapse=" ")
# Create all_tweets
all_tweets=c(all_coffee,all_chardonnay)
# Convert to a vector source
all_tweets=VectorSource(all_tweets)
# Create all_corpus
all_corpus=VCorpus(all_tweets)
2.13 Visualize common words
Instruction:
- Create
all_clean
by applying the predefinedclean_corpus()
function toall_corpus
. - Create
all_tdm
, aTermDocumentMatrix
from all_clean. - Create
all_m
by convertingall_tdm
to a matrix object. - Create a
commonality.cloud()
fromall_m
withmax.words = 100
andcolors = "steelblue1"
.
# Clean the corpus
all_clean <- clean_corpus(all_corpus)
# Create all_tdm
all_tdm <- TermDocumentMatrix(all_clean)
# Create all_m
all_m <- as.matrix(all_tdm)
# Print a commonality cloud
commonality.cloud(all_m, max.words = 100, colors = "steelblue1")
2.14 Visualize dissimilar words
Instruction:
all_corpus
is preloaded in your workspace.
- Create
all_clean
by applying the predefinedclean_corpus
function toall_corpus
. - Create
all_tdm
, aTermDocumentMatrix
, fromall_clean
. - Use
colnames()
to rename each distinct corpora withinall_tdm
. Name the first column “coffee” and the second column “chardonnay”. - Create
all_m
by convertingall_tdm
into matrix form. - Create a
comparison.cloud()
usingall_m
, withcolors = c("orange", "blue")
andmax.words = 50
.
# Clean the corpus
all_clean <- clean_corpus(all_corpus)
# Create all_tdm
all_tdm <- TermDocumentMatrix(all_clean)
# Give the columns distinct names
colnames(all_tdm) <- c("coffee", "chardonnay")
# Create all_m
all_m <- as.matrix(all_tdm)
# Create comparison cloud
comparison.cloud(all_m,
colors = c("orange", "blue"),
max.words = 50)
2.15 Polarized tag cloud
Instruction 1:
- Convert
all_tdm_m
to a data frame. Set the rownames to a column named"word"
. - Filter to keep rows where all columns are greater than zero, using the syntax
. > 0
. - Add a column named
difference
, equal to the count in thechardonnay
column minus the count in thecoffee
column. - Filter to keep the top
25
rows bydifference
. - Arrange the rows by
desc()
ending order ofdifference
.
top25_df <- all_tdm_m %>%
# Convert to data frame
as_data_frame(rownames = "word") %>%
# Keep rows where word appears everywhere
filter_all(all_vars(. > 0)) %>%
# Get difference in counts
mutate(difference = chardonnay - coffee) %>%
# Keep rows with biggest difference
top_n(25, wt = difference) %>%
# Arrange by descending difference
arrange(desc(difference))
Instruction 2:
- Set the left count to the
chardonnay
column. - Set the right count to the
coffee
column. - Set the labels to the
word
column.
top25_df <- all_tdm_m %>%
# Convert to data frame
as_data_frame(rownames = "word") %>%
# Keep rows where word appears everywhere
filter_all(all_vars(. > 0)) %>%
# Get difference in counts
mutate(difference = chardonnay - coffee) %>%
# Keep rows with biggest difference
top_n(25, wt = difference) %>%
# Arrange by descending difference
arrange(desc(difference))
pyramid.plot(
# Chardonnay counts
top25_df$chardonnay,
# Coffee counts
top25_df$coffee,
# Words
labels = top25_df$word,
top.labels = c("Chardonnay", "Words", "Coffee"),
main = "Words in Common",
unit = NULL,
gap = 8,
)
2.16 Visualize word networks
Instruction:
Update the word_associate()
plotting code to work with the coffee data.
- Change the vector to
coffee_tweets$text
. - Change the match string to
"barista"
. - Change
"chardonnay"
to"coffee"
in the stopwords too. - Change the title to
"Barista Coffee Tweet Associations"
in the sample code for the plot.
# Word association
word_associate(coffee_tweets$text, match.string = "barista",
stopwords = c(Top200Words, "coffee", "amp"),
network.plot = TRUE, cloud.colors = c("gray85", "darkred"))
# Add title
title(main = "Barista Coffee Tweet Associations")
2.17 Teaser simple word clustering
Instruction:
A hierarchical cluster object, hc
, has been created for you from the coffee tweets.
Create a dendrogram using plot()
on hc
.
# Plot a dendrogram
plot(hc)
3. Adding to Your tm Skills
3.1 Simple word clustering (video)
3.2 Test your understanding of text mining
3.3 Distance matrix and dendrogram
Instruction:
The data frame rain
has been preloaded in your workspace.
- Create
dist_rain
by using thedist()
function on the values in the second column of rain. - Print the
dist_rain
matrix to the console. - Create
hc
by performing a cluster analysis, usinghclust()
ondist_rain
. plot()
thehc
object withlabels = rain$city
to add the city names.
# Create dist_rain
dist_rain <- dist(rain[, 2])
# View the distance matrix
dist_rain
# Create hc
hc <- hclust(dist_rain)
# Plot hc
plot(hc, labels = rain$city)
3.4 Make a dendrogram friendly TDM
Instruction:
tweets_tdm
has been created using the chardonnay tweets.
- Print the dimensions of
tweets_tdm
to the console. - Create
tdm1
usingremoveSparseTerms()
withsparse = 0.95
ontweets_tdm
. - Create
tdm2
usingremoveSparseTerms()
withsparse = 0.975
ontweets_tdm
. - Print
tdm1
to the console to see how many terms are left. - Print
tdm2
to the console to see how many terms are left.
# Print the dimensions of tweets_tdm
dim(tweets_tdm)
# Create tdm1
tdm1 <- removeSparseTerms(tweets_tdm, sparse = 0.95)
# Create tdm2
tdm2 <- removeSparseTerms(tweets_tdm, sparse = 0.975)
# Print tdm1
tdm1
# Print tdm2
tdm2
3.5 Put it all together: a tea based dendrogram
Instruction:
- Create
tweets_tdm2
by applyingremoveSparseTerms()
ontweets_tdm
. Usesparse = 0.975
. - Create
tdm_m
by usingas.matrix()
ontweets_tdm2
to convert it to matrix form. - Create
tweets_dist
containing the distances oftdm_m
using thedist()
function. - Create a hierarchical cluster object called
hc
usinghclust()
ontweets_dist
.
Make a dendrogram withplot()
andhc
.
# Create tweets_tdm2
tweets_tdm2 <- removeSparseTerms(tweets_tdm, sparse = 0.975)
# Create tdm_m
tdm_m <- as.matrix(tweets_tdm2)
# Create tweets_dist
tweets_dist <- dist(tdm_m)
# Create hc
hc <- hclust(tweets_dist)
# Plot the dendrogram
plot(hc)
3.6 Dendrogram aesthetics
Instruction:
The dendextend
package has been loaded for you, and a hierarchical cluster object, hc
, was created from tweets_dist
.
- Create
hcd
as a dendrogram usingas.dendrogram()
onhc
. - Print the
labels
ofhcd
to the console. - Use
branches_attr_by_labels()
to color the branches. Pass it three arguments: thehcd
object,c("marvin", "gaye")
, and the color"red"
. Assign tohcd_colored
. plot()
the dendrogramhcd_colored
with the title"Better Dendrogram"
, added using themain
argument.- Add rectangles to the plot using
rect.dendrogram()
. Specifyk = 2
clusters and aborder
color of"grey50"
.
# Create hcd
hcd <- as.dendrogram(hc)
# Print the labels in hcd
labels(hcd)
# Change the branch color to red for "marvin" and "gaye"
hcd_colored <- branches_attr_by_labels(hcd, c("marvin", "gaye"), color = "red")
# Plot hcd
plot(hcd_colored, main = "Better Dendrogram")
# Add cluster rectangles
rect.dendrogram(hcd_colored, k = 2, border = "grey50")
3.7 Using word association
Instruction:
- Create
associations
usingfindAssocs()
ontweets_tdm
to find terms associated with “venti”, which meet a minimum threshold of0.2
. - View the terms associated with “venti” by printing
associations
to the console. - Create
associations_df
, by callinglist_vect2df()
, passingassociations
, then settingcol2
to"word"
andcol3
to"score"
. - Run the
ggplot2
code to make a dot plot of the association values.
# Create associations
associations <- findAssocs(tweets_tdm, "venti", 0.2)
# View the venti associations
associations
# Create associations_df
associations_df <- list_vect2df(associations, col2 = "word", col3 = "score")
# Plot the associations_df values
ggplot(associations_df, aes(score, word)) +
geom_point(size = 3) +
theme_gdocs()
3.8 Getting past single words (video)
3.9 N-gram tokenization
3.10 Changing n-grams
Instruction:
A corpus
has been preprocessed as before using the chardonnay tweets. The resulting object text_corp
is available in your workspace.
- Create a
tokenizer
function like the above which creates 2-word bigrams. - Make
unigram_dtm
by callingDocumentTermMatrix()
ontext_corp
without using thetokenizer()
function. - Make
bigram_dtm
usingDocumentTermMatrix()
ontext_corp
with thetokenizer()
function you just made. - Examine
unigram_dtm
andbigram_dtm
. Which has more terms?
# Make tokenizer function
tokenizer <- function(x)
NGramTokenizer(x, Weka_control(min = 2, max = 2))
# Create unigram_dtm
unigram_dtm <- DocumentTermMatrix(text_corp)
# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(
text_corp,
control = list(tokenize = tokenizer))
# Print unigram_dtm
unigram_dtm
# Print bigram_dtm
bigram_dtm
3.11 How do bigrams affect word clouds?
Instruction:
The chardonnay tweets have been cleaned and organized into a DTM called bigram_dtm
.
- Create
bigram_dtm_m
by convertingbigram_dtm
to a matrix. - Create an object
freq
consisting of the word frequencies by applyingcolSums()
onbigram_dtm_m
. - Extract the character vector of word combinations with
names(freq)
and assign the result tobi_words
. - Pass
bi_words
tostr_subset()
with the matching pattern"^marvin"
to review all bigrams starting with “marvin”. - Plot a simple
wordcloud()
passingbi_words
,freq
andmax.words = 15
into the function.
# Create bigram_dtm_m
bigram_dtm_m <- as.matrix(bigram_dtm)
# Create freq
freq <- colSums(bigram_dtm_m)
# Create bi_words
bi_words <- names(freq)
# Examine part of bi_words
str_subset(bi_words, pattern = "^marvin")
# Plot a wordcloud
wordcloud(bi_words, freq, max.words = 15)
3.12 Different frequency criteria (video)
3.13 Changing frequency weights
Instruction 1:
- Create
tdm
, a term frequency-basedTermDocumentMatrix()
usingtext_corp
. - Create
tdm_m
by convertingtdm
to matrix form. - Examine the term frequency for “coffee”, “espresso”, and “latte” in a few tweets. Subset
tdm_m
to get rowsc("coffee", "espresso", "latte")
and columns 161 to 166.
# Create a TDM
tdm <- TermDocumentMatrix(text_corp)
# Convert it to a matrix
tdm_m <- as.matrix(tdm)
# Examine part of the matrix
tdm_m[c("coffee", "espresso", "latte"), 161:166]
Instruction 2:
- Edit the
TermDocumentMatrix()
to useTfIdf
weighting. Passcontrol = list(weighting = weightTfIdf)
as an argument to the function. - Run the code and compare the new scores to the first part of the exercise.
# Edit the controls to use Tfidf weighting
tdm <- TermDocumentMatrix(text_corp,
control = list(weighting = weightTfIdf))
# Convert to matrix again
tdm_m <- as.matrix(tdm)
# Examine the same part: how has it changed?
tdm_m[c("coffee", "espresso", "latte"), 161:166]
3.14 Capturing metadata in tm
Instruction:
- Rename the first column of
tweets
to “doc_id”. - Set the document schema with
DataframeSource()
on the smallertweets
data frame. - Make the document collection a volatile corpus nested in the custom
clean_corpus()
function. - Apply
content()
to the first tweet with double brackets such astext_corpus[[1]]
to see the cleaned plain text. - Confirm that all metadata was captured using the
meta()
function on the first document with single brackets.
Remember, when accessing part of a corpus the double or single brackets make a difference! For this exercise you will use double brackets with content()
and single brackets with meta()
.
# Rename columns
names(tweets)[1] <- "doc_id"
# Set the schema: docs
docs <- DataframeSource(tweets)
# Make a clean volatile corpus: text_corpus
text_corpus <- clean_corpus( VCorpus(docs))
# Examine the first doc content
content(text_corpus[[1]])
# Access the first doc metadata
meta(text_corpus[1])
4. Battle of the Tech Giants for Talent
4.1 Amazon vs. Google (video)
4.2 Organizing a text mining project
4.3 Step 1: Problem definition
4.4 Step 2: Identifying the text sources
Instruction:
- View the structure of
amzn
withstr()
to get its dimensions and a preview of the data. - Create
amzn_pros
from the positive reviews columnamzn$pros
. - Create
amzn_cons
from the negative reviews columnamzn$cons
. - Print the structure of
goog
withstr()
to get its dimensions and a preview of the data. - Create
goog_pros
from the positive reviews columngoog$pros
. - Create
goog_cons
from the negative reviews columngoog$cons
.
# Print the structure of amzn
str(amzn)
# Create amzn_pros
amzn_pros <- amzn$pros
# Create amzn_cons
amzn_cons <- amzn$cons
# Print the structure of goog
str(goog)
# Create goog_pros
goog_pros <- goog$pros
# Create goog_cons
goog_cons <- goog$cons
4.5 Step 3:Text organization (video)
4.6 Text organization
Instruction 1:
- Apply
qdap_clean()
toamzn_pros
, assigning toqdap_cleaned_amzn_pros
. - Create a vector source (
VectorSource()
) fromqdap_cleaned_amzn_pros
, then turn it into a volatile corpus (VCorpus()
), assigning toamzn_p_corp
. - Create
amzn_pros_corp
by applyingtm_clean()
toamzn_p_corp
.
# qdap_clean the text
qdap_cleaned_amzn_pros <- qdap_clean(amzn_pros)
# Source and create the corpus
amzn_p_corp <- VCorpus(VectorSource(qdap_cleaned_amzn_pros))
# tm_clean the corpus
amzn_pros_corp <- tm_clean(amzn_p_corp)
Instruction 2:
- Apply
qdap_clean()
toamzn_cons
, assigning toqdap_cleaned_amzn_cons
. - Create a vector source from
qdap_cleaned_amzn_cons
, then turn it into a volatile corpus, assigning toamzn_c_corp
. - Create
amzn_cons_corp
by applyingtm_clean()
toamzn_c_corp
.
# qdap_clean the text
qdap_cleaned_amzn_cons <- qdap_clean(amzn_cons)
# Source and create the corpus
amzn_c_corp <- VCorpus(VectorSource(qdap_cleaned_amzn_cons))
# tm_clean the corpus
amzn_cons_corp <- tm_clean(amzn_c_corp)
4.7 Working with Google reviews
Instruction 1:
- Apply
qdap_clean()
togoog_pros
, assigning toqdap_cleaned_goog_pros
. - Create a vector source (
VectorSource()
) fromqdap_cleaned_goog_pros
, then turn it into a volatile corpus (VCorpus()
), assigning togoog_p_corp
. - Create
goog_pros_corp
by applyingtm_clean()
togoog_p_corp
.
# qdap_clean the text
qdap_cleaned_goog_pros <- qdap_clean(goog_pros)
# Source and create the corpus
goog_p_corp <- VCorpus(VectorSource(qdap_cleaned_goog_pros))
# tm_clean the corpus
goog_pros_corp <- tm_clean(goog_p_corp)
Instruction 2:
- Apply
qdap_clean()
togoog_cons
, assigning toqdap_cleaned_goog_cons
. - Create a vector source from
qdap_cleaned_goog_cons
, then turn it into a volatile corpus, assigning togoog_c_corp
. - Create
goog_cons_corp
by applyingtm_clean()
togoog_c_corp
.
# qdap clean the text
qdap_cleaned_goog_cons <- qdap_clean(goog_cons)
# Source and create the corpus
goog_c_corp <- VCorpus(VectorSource(qdap_cleaned_goog_cons))
# tm clean the corpus
goog_cons_corp <- tm_clean(goog_c_corp)
4.8 Steps 4 & 5: Feature extraction & analysis
4.9 Feature extraction & analysis: amzn_pros
Instruction:
- Create
amzn_p_tdm
as aTermDocumentMatrix
fromamzn_pros_corp
. Make sure to addcontrol = list(tokenize = tokenizer)
so that the terms are bigrams. - Create
amzn_p_tdm_m
fromamzn_p_tdm
by using theas.matrix()
function. - Create
amzn_p_freq
to obtain the term frequencies fromamzn_p_tdm_m
. - Create a
wordcloud()
usingnames(amzn_p_freq)
as the words,amzn_p_freq
as their frequencies, andmax.words = 25
andcolor = "blue"
for aesthetics.
# Create amzn_p_tdm
amzn_p_tdm <- TermDocumentMatrix(
amzn_pros_corp,
control = list(tokenize = tokenizer)
)
# Create amzn_p_tdm_m
amzn_p_tdm_m <- as.matrix(amzn_p_tdm)
# Create amzn_p_freq
amzn_p_freq <- rowSums(amzn_p_tdm_m)
# Plot a wordcloud using amzn_p_freq values
wordcloud(names(amzn_p_freq),
amzn_p_freq,
max.words = 25,
color = "blue")
4.10 Feature extraction & analysis: amzn_cons
Instruction:
Create amzn_c_tdm
by converting amzn_cons_corp
into a TermDocumentMatrix
and incorporating the bigram function control = list(tokenize = tokenizer)
.
Create amzn_c_tdm_m
as a matrix version of amzn_c_tdm
.
Create amzn_c_freq
by using rowSums()
to get term frequencies from amzn_c_tdm_m
.
Create a wordcloud()
using names(amzn_c_freq)
and the values amzn_c_freq
. Use the arguments max.words = 25
and color = "red"
as well.
# Create amzn_c_tdm
amzn_c_tdm <- TermDocumentMatrix(
amzn_cons_corp,
control = list(tokenize = tokenizer))
# Create amzn_c_tdm_m
amzn_c_tdm_m <- as.matrix(amzn_c_tdm)
# Create amzn_c_freq
amzn_c_freq <- rowSums(amzn_c_tdm_m)
# Plot a wordcloud of negative Amazon bigrams
wordcloud(names(amzn_c_freq), amzn_c_freq,
max.words = 25, color = "red")
4.11 amzn_cons dendrogram
Instruction:
- Create
amzn_c_tdm
as aTermDocumentMatrix
usingamzn_cons_corp
withcontrol = list(tokenize = tokenizer)
. - Print
amzn_c_tdm
to the console. - Create
amzn_c_tdm2
by applying theremoveSparseTerms()
function toamzn_c_tdm
with the sparse argument equal to.993
. - Create
hc
, a hierarchical cluster object by nesting the distance matrixdist(amzn_c_tdm2)
inside thehclust()
function. Make sure to also passmethod = "complete"
to thehclust()
function. - Plot
hc
to view the clustered bigrams and see how the concepts in the Amazon cons section may lead you to a conclusion.
# Create amzn_c_tdm
amzn_c_tdm <- TermDocumentMatrix(
amzn_cons_corp,
control = list(tokenize = tokenizer))
# Print amzn_c_tdm to the console
amzn_c_tdm
# Create amzn_c_tdm2 by removing sparse terms
amzn_c_tdm2 <- removeSparseTerms(amzn_c_tdm, sparse = .993)
# Create hc as a cluster of distance values
hc <- hclust(
d = dist(amzn_c_tdm2), method = "complete")
# Produce a plot of hc
plot(hc)
4.12 Word association
Instruction:
The amzn_pros_corp
corpus has been cleaned using the custom functions like before.
Construct a TDM called amzn_p_tdm
from amzn_pros_corp
and control = list(tokenize = tokenizer)
.
Create amzn_p_m
by converting amzn_p_tdm
to a matrix.
Create amzn_p_freq
by applying rowSums()
to amzn_p_m
.
Create term_frequency
using sort()
on amzn_p_freq
along with the argument decreasing = TRUE
.
Examine the first 5 bigrams using term_frequency[1:5]
.
You may be surprised to see “fast paced” as a top term because it could be a negative term related to “long hours”. Look at the terms most associated with “fast paced”. Use findAssocs()
on amzn_p_tdm
to examine "fast paced"
with a 0.2
cutoff.
# Create amzn_p_tdm
amzn_p_tdm <- TermDocumentMatrix(
amzn_pros_corp,
control = list(tokenize = tokenizer)
)
# Create amzn_p_m
amzn_p_m <- as.matrix(amzn_p_tdm)
# Create amzn_p_freq
amzn_p_freq <- rowSums(amzn_p_m)
# Create term_frequency
term_frequency <- sort(amzn_p_freq, decreasing = T)
# Print the 5 most common terms
term_frequency[1:5]
# Find associations with fast paced
associations <- findAssocs(amzn_p_tdm, "fast paced", 0.2)
associations
4.13 Quick review of Google reviews
Instruction:
The all_goog_corpus
object consisting of Google pro and con reviews is loaded in your workspace.
Create all_goog_corp
by cleaning all_goog_corpus
with the predefined tm_clean()
function.
Create all_tdm
by converting all_goog_corp
to a term-document matrix.
Create all_m
by converting all_tdm
to a matrix.
Construct a comparison.cloud()
from all_m
. Set max.words
to 100
. The colors
argument is specified for you.
# Create all_goog_corp
all_goog_corp <- tm_clean(all_goog_corpus)
# Create all_tdm
all_tdm <- TermDocumentMatrix(all_goog_corp)
# Create all_m
all_m <- as.matrix(all_tdm)
# Build a comparison cloud
comparison.cloud(all_m,
colors = c("#F44336", "#2196f3"),
max.words = 100)
4.14 Cage match! Amazon vs. Google pro reviews
Instruction:
- Create
common_words
fromall_tdm_df
usingdplyr
functions.filter()
on theAmazonPro
column for nonzero values.- Likewise filter the
GooglePro
column for nonzero values. - Then
mutate()
a new column,diff
which is theabs
(absolute) difference between the term frequencies columns.
- Although we could have piped again, create
top5_df
by applyingtop_n
tocommon_words
to extract the top5
values in thediff
column. It will print to your console for review. - Create a
pyramid.plot
passing intop5_df$AmazonPro
thentop5_df$GooglePro
and finally add labels withtop5_df$terms
.
# Filter to words in common and create an absolute diff column
common_words <- all_tdm_df %>%
filter(
AmazonPro != 0,
GooglePro != 0
) %>%
mutate(diff = abs(AmazonPro - GooglePro))
# Extract top 5 common bigrams
(top5_df <- top_n(common_words, 5))
# Create the pyramid plot
pyramid.plot(top5_df$AmazonPro, top5_df$GooglePro,
labels = top5_df$terms, gap = 12,
top.labels = c("Amzn", "Pro Words", "Goog"),
main = "Words in Common", unit = NULL)
4.15 Cage match, part 2! Negative reviews
Instruction:
- Using
top_n()
oncommon_words
, obtain the top5
bigrams weighted on thediff
column. The results of the new object will print to your console. - Create a
pyramid.plot()
. Pass intop5_df$AmazonNeg
,top5_df$GoogleNeg
, and labels = top5_df$terms. For better labeling, setgap
to12
.top.labels
toc("Amzn", "Neg Words", "Goog")
The main
and unit
arguments are set for you.
# Extract top 5 common bigrams
(top5_df <- top_n(common_words, 5, wt = diff))
# Create a pyramid plot
pyramid.plot(
# Amazon on the left
top5_df$AmazonNeg,
# Google on the right
top5_df$GoogleNeg,
# Use terms for labels
labels = top5_df$terms,
# Set the gap to 12
gap = 12,
# Set top.labels to "Amzn", "Neg Words" & "Goog"
top.labels = c("Amzn", "Neg Words", "Goog"),
main = "Words in Common",
unit = NULL
)
4.16 Step 6: Reach a conclusion (video)
4.17 Draw conclusions, insights, or recommendations
Instruction:
在这里插入代码片
4.18 Draw another conclusion, insight, or recommendation
Instruction:
在这里插入代码片