library("tm") #text mining
library("SnowballC") #word stemming if necessary
library("wordcloud2") #word cloud generation
library("RColorBrewer") #color of word cloud
library("webshot") #save a word cloud as image
library("htmlwidgets")
2. clean text
因为我处理的是歌词文本,所以需要去除标点、数字和一些常见特殊符号(unicode $ @等)。需要注意的是像I’m I’d We’d 这样的词在去除标点的时候R是默认不作考虑。根据项目要求,我们把I’m 变成 I m来处理.
clean.text <-function(x){
# remove rt
##x =gsub("rt","", x)
# remove at
x =gsub("@\\w+","", x)
# remove punctuation
x =gsub("[[:punct:]]","", x)
# remove numbers
x =gsub("[[:digit:]]","", x)
# I'm to I m
x =gsub("'"," "