Text Mining

http://www.rdatamining.com/examples/text-mining

Text Mining

This page shows an example on text mining of Twitter data with R packages  twitteRtm and  wordcloud. Package  twitteR provides access to Twitter data,  tm provides functions for text mining, and  wordcloudvisualizes the result with a word cloud.

If you have no access to Twitter, the tweets data can be downloaded as file "rdmTweets.RData" at  the Data page, and then you can skip the first step below.

Retrieving Text from Twitter

Twitter API requires authentication since March 2013. Please follow instructions in "Section 3 - Authentication with OAuth" in  the twitteR vignettes on CRAN or  this link to complete authentication before running the code below.

> library(twitteR)
> # retrieve the first 100 tweets (or all tweets if fewer than 100)
> # from the user timeline of @rdatammining
> rdmTweets <- userTimeline("rdatamining", n=100)
> n <- length(rdmTweets)
> rdmTweets[1:3]
[[1]]
Text Mining Tutorial http://t.co/jPHHLEGm
[[2]]
R cookbook with examples http://t.co/aVtIaSEg
[[3]]
Access large amounts of Twitter data for data mining and other tasks within 
R via the twitteR package. http://t.co/ApbAbnxs

Transforming Text

The tweets are first converted to a data frame and then to a corpus.

> df <- do.call("rbind", lapply(rdmTweets, as.data.frame))
> dim(df)
[1] 79 10

> library(tm)
> # build a corpus, which is a collection of text documents
> # VectorSource specifies that the source is character vectors.
> myCorpus <- Corpus(VectorSource(df$text))

After that, the corpus needs a couple of transformations, including changing letters to lower case, removing punctuations/numbers and removing stop words. The general English stop-word list is tailored by adding "available" and "via" and removing "r".

> myCorpus <- tm_map(myCorpus, tolower)
> # remove punctuation
> myCorpus <- tm_map(myCorpus, removePunctuation)
> # remove numbers
> myCorpus <- tm_map(myCorpus, removeNumbers)
> # remove stopwords
> # keep "r" by removing it from stopwords
> myStopwords <- c(stopwords('english'), "available", "via")
> idx <- which(myStopwords == "r")
> myStopwords <- myStopwords[-idx]
> myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

Stemming Words

In many cases, words need to be stemmed to retrieve their radicals. For instance, "example" and "examples" are both stemmed to "exampl". However, after that, one may want to complete the stems to their original forms, so that the words would look "normal".

> dictCorpus <- myCorpus
> # stem words in a text document with the snowball stemmers,
> # which requires packages Snowball, RWeka, rJava, RWekajars
> myCorpus <- tm_map(myCorpus, stemDocument)
> # inspect the first three ``documents"
> inspect(myCorpus[1:3])
(Some details are removed to make it short. Same applies to inspect() below.)
[[1]]
text mine tutori  httptcojphhlegm
[[2]]
r cookbook exampl httptcoavtiaseg
[[3]]
access amount twitter data data mine task r twitter packag httptcoapbabnx

> # stem completion
> myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=dictCorpus)

Print the first three documents in the built corpus.

> inspect(myCorpus[1:3])
[[1]]
text miners tutorial httptcojphhlegm
[[2]]
r cookbook examples httptcoavtiaseg
[[3]]
access amounts twitter data data miners task r twitter package httptcoapbabnxs

Something unexpected in the above stemming and stem completion is that, word "mining" is first stemmed to "mine", and then is completed to "miners", instead of "mining", although there are many instances of "mining" in the tweets, compared to only one instance of "miners".

Building a Document-Term Matrix

> myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
> inspect(myDtm[266:270,31:40])
A term-document matrix (5 terms, 10 documents)
Non-/sparse entries: 9/41
Sparsity : 82%
Maximal term length: 12
Weighting : term frequency (tf)
             Docs
Terms        31 32 33 34 35 36 37 38 39 40
r             0  0  1  1  1  0  1  2  1  0
ramachandran  0  0  0  0  0  0  1  0  0  0
ranked        0  0  0  1  0  0  0  0  0  0
rapidminer    0  0  0  0  0  0  0  0  0  0
rdatamining   0  0  1  0  0  0  0  0  0  0

Based on the above matrix, many data mining tasks can be done, for example, clustering, classification and association analysis.

Frequent Terms and Associations

> findFreqTerms(myDtm, lowfreq=10)
[1] "analysis" "data" "examples" "miners" "package" "r" "slides"
[8] "tutorial" "users"

> # which words are associated with "r"?
> findAssocs(myDtm, 'r', 0.30)
   r  users examples package canberra cran list
1.00   0.44     0.34    0.31     0.30 0.30 0.30

> # which words are associated with "mining"?

> # Here "miners" is used instead of "mining",
> # because the latter is stemmed and then completed to "miners". :-(
> findAssocs(myDtm, 'miners', 0.30)
miners data classification httptcogbnpv mahout
  1.00 0.56           0.47         0.47   0.47
recommendation sets supports frequent itemset
          0.47 0.47     0.47     0.40    0.39

Word Cloud

After building a document-term matrix, we can show the importance of words with a word cloud (also kown as a tag cloud) . In the code below, word "miners" are changed back to "mining".

> library(wordcloud)
> m <- as.matrix(myDtm)
> # calculate the frequency of words
> v <- sort(rowSums(m), decreasing=TRUE)
> myNames <- names(v)
> k <- which(names(v)=="miners")
> myNames[k] <- "mining"
> d <- data.frame(word=myNames, freq=v)
> wordcloud(d$word, d$freq, min.freq=3)



The above word cloud clearly shows that "r", "data" and "mining" are the three most important words, which validates that the @RDataMining tweets present information on R and data mining. The other important words are "analysis", "examples", "slides", "tutorial" and "package", which shows that it focuses on documents and examples on analysis and R packages.

More examples on text mining with R and other data mining techniques can be found in my book " R and Data Mining: Examples and Case Studies", which is downloadable as a .PDF file at the link.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值