文本挖掘
junjun
2016年2月4日
文本分析的应用越来越广泛,今天就讲讲关于评论数据的那点事。
评论数据的获取:一般通过网络爬虫的方式抓取各大网站的评论数据,本次分析数据就来源于携程网某酒店的评论,成功爬取该酒店的评论数据,于是我开始使用这些数据做相关的分析。(注意:数据分析、挖掘时,这部分工作可以有专门的人员来完成)
1、加载数据和包
#1)本文使用的包主要有三个:Rwordseg包用于分词 tmcn用于词频统计 wordcloud用于绘制文字云
library(Rwordseg)
## Loading required package: rJava
## # Version: 0.2-1
library(tmcn)
## # tmcn Version: 0.1-4
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.3
## Loading required package: RColorBrewer
#注意:Rwordseg和tmcn包需要手动安装
#install.packages("F:\\R\\R-3.2.2\\library/Rwordseg_0.2-1.tar.gz", repos = NULL, type = "source")
#install.packages("F:\\R\\R-3.2.2\\library/tmcn_0.1-4.zip", repos = NULL, type = "source")
#2)读取数据并查看
#Evaluation <- read.csv(file=file.choose(), encoding = "UFT-8")
Evaluation <- read.csv(file="F:\\R\\Rworkspace\\文本挖掘\\携程网评论数据分析/携程评价信息采集.csv", encoding = "UFT-8")
str(Evaluation)
## 'data.frame': 1500 obs. of 2 variables:
## $ Score : num 4.8 4.3 4.3 4.3 4.5 4.8 4.5 4.8 4.5 5 ...
## $ Evaluation: Factor w/ 1400 levels "1,硬件都很好 干净 舒适 2,办入住的时候,前台服务人员都半死不活的样子,也不知道怎么那么不高兴! 3,隔音不好,隔壁看电视很大声,"| __truncated__,..: 131 1173 645 1150 824 1269 862 395 46 203 ...
#从上可知:共计1500条记录,2个变量,一个为评论得分(数字类型),另一个为评论内容(因子类型)
2、数据清洗
#删除评论数据中的英文和数字
text <- gsub("[a-zA-Z0-9]", "", Evaluation$Evaluation)
str(text)
## chr [1:1500] "超级方便,地铁浦东大道站出口出来对面就是,不用走很多路,穿很小的马路。只是周边最近在施工,看起来比较简陋。酒店还是很不错的。床"| __truncated__ ...
3、分词
对1500条评论记录的每条评论进行分词
#1)对原始评论分词
segword <- segmentCN(strwords = text)
#str(segword)
#查看第一条评论的分词效果
segword[1]
## [[1]]
## [1] "超级" "方便" "地铁" "浦东" "大道" "站" "出口"
## [8] "出来" "对面" "就" "是" "不用" "走" "很多路"
## [15] "穿" "很" "小" "的" "马路" "只" "是"
## [22] "周边" "最近" "在" "施工" "看起来" "比较" "简陋"
## [29] "酒店" "还" "是" "很" "不错" "的" "床"
## [36] "也" "很" "舒服" "整个" "都" "比较" "安静"
## [43] "私" "密" "就" "是" "早餐" "自助餐" "稍微"
## [50] "简单" "了" "点" "没什么" "特别" "的" "寿"
## [57] "司" "包饭" "很" "不" "好吃" "鸡蛋" "饼"
## [64] "很" "赞" "儿童" "半价" "八十九" "也" "略"
## [71] "贵" "了" "些" "这个" "价格" "一般" "都"
## [78] "可以" "享用" "晚餐" "了" "毕竟" "早饭" "还"
## [85] "是" "单调" "的" "有点" "感觉" "不" "值"
## [92] "虽然" "有" "双" "早" "也" "不能" "撇下"
## [99] "一个" "出去" "吃"
#从上可知:经分割后的词中有许多无意义的停止词,如“是”,“只”,“了”,“也”等,这些词是需要剔除的。关于停止词,可以到网上搜索获取。
#2)读取停止词并创建停止词词库:停止词库可以从网上下载,此处已经下载好了mystopwords.txt
#mystopwords <- read.table(file=file.choose(), stringsAsFactors = F)
mystopwords <- read.table(file="F:\\R\\Rworkspace\\文本挖掘\\携程网评论数据分析/mystopwords.txt", stringsAsFactors = F)
str(mystopwords)
## 'data.frame': 10331 obs. of 1 variable:
## $ V1: chr "累累" "要" "漏风声" "好些" ...
head(mystopwords)
## V1
## 1 累累
## 2 要
## 3 漏风声
## 4 好些
## 5 认识
## 6 覆
#从上可知:读取的停止词的格式为数据框,需要转化为向量格式
mystopwords <- as.vector(mystopwords[, 1])
head(mystopwords)
## [1] "累累" "要" "漏风声" "好些" "认识" "覆"
#3)自定义删除停止词函数:现在有了停止词词库,接下来需要将分割后的词与停止词词库进行比对,将含有停止词的词进行剔除。
removewords <- function(target_words, stop_words) {
target_words <- target_words[target_words %in% stop_words==F]
return(target_words)
}
#4)从分词后的评论结果中删除停止词:即将该函数应用到已分割的词中
segword2 <- sapply(segword, removewords, mystopwords)
#str(segword2)
segword2[1]
## [[1]]
## [1] "地铁" "浦东" "大道" "出口" "走" "很多路" "穿"
## [8] "马路" "施工" "简陋" "酒店" "不错" "床" "舒服"
## [15] "安静" "密" "早餐" "自助餐" "简单" "包饭" "好吃"
## [22] "鸡蛋" "饼" "儿童" "半价" "八十九" "享用" "晚餐"
## [29] "早饭" "单调" "感觉" "撇下" "吃"
#从上可知:,一些无意义的停止词已经被剔除,下面就使用比较干净的词绘制文字云,以大致查看分词效果。
4、绘制词云
#1)获取分词后的词频
library(tmcn)
word_freq <- getWordFreq(string=unlist(segword2))
str(word_freq)
## 'data.frame': 2113 obs. of 2 variables:
## $ Word: chr "酒店" "不错" "房间" "服务" ...
## $ Freq: int 771 742 375 363 294 251 205 196 195 189 ...
#2)绘制词云:绘制出现频率最高的前50个词
opar <- par(no.readonly = T)
par(bg="white")
wordcloud(words = word_freq$Word, freq=word_freq$Freq, max.words = 50, random.color = T, colors = rainbow(7))
par(opar)
#从上图可知:发现“不错”这个词非常明显,但到底是什么不错呢?下面来看一看都是哪些评论包含不错这样的字眼。
#3)还原高频出现评分词汇的初始评价:根据频繁出现词汇,还原初始评价
index <- NULL
for(i in 1:length(segword)) {
if(any(segword[[i]] %in% "不错") == T)
index <- unique(c(index, i))
}
#包含不错的评论总数
length(index)
## [1] 658
text[1]
## [1] "超级方便,地铁浦东大道站出口出来对面就是,不用走很多路,穿很小的马路。只是周边最近在施工,看起来比较简陋。酒店还是很不错的。床也很舒服。整个都比较安静,私密。就是早餐自助餐稍微简单了点,没什么特别的,寿司包饭很不好吃,鸡蛋饼很赞。儿童半价八十九也略贵了些,这个价格一般都可以享用晚餐了。毕竟早饭还是单调的,有点感觉不值。虽然有双早,也不能撇下一个出去吃"
#从上可知:含有“不错”字眼的评论有658条,这就需要人为干涉,将这些“不错”进行简化并组成词典。这是一个非常繁工的过程,需要耐心的查看这些评论中都是怎么表达的情感的。经过约3个小时的人为选词(不断反复查看),将这些词组成词典,并导入为自定义词汇。(可能该方法比较笨拙,如有更好的方法,还请看官指导)。
#4)自定义特征词汇
words <- c('房间干净','服务不错','酒店不错','不错的酒店','不错的地方','卫生不错','设施不错','设备不错','硬件不错','位置不错','地段不错','景色不错','景观不错','环境不错','风景不错','视野不错','夜景不错','口味不错','味道不错','感觉不错','态度不错','态度冷漠','态度冷淡','服务差劲','热情','热心','不热情','态度好','态度差','态度不好','素质差','质量不错','房间不错','浴缸不错','早餐不错','早餐质量差','自助餐不错','下午茶不错','强烈推荐','推荐入住','值得推荐','性价比不错','隔音不错','体验不错','不错的体验','设施陈旧','五星级酒店','性价比不错','交通便利','交通方便','出行方便','房间小','价格不错','前台效率太低','携程','地理位置','陆家嘴')
#5)把自定义特征词汇插入到词典中
insertWords(strwords = words)
#6)汇总原始评论中的干扰词、语气词(需要删除):由于上面的词汇都是经过简化而成的,而原始评论可能是“房间很干净”,“服务还是蛮不错的”,“酒店真心不错”等,所以就需要剔除这些干扰分词的词(“还是”,“蛮”,“真心”,“的”等)。
pattern <- c('还是','很也','了','点','可以','还','是','真心','都','相当','大家','确实','挺','非常','应该','蛮','整体','里面','就','实在','总体','听说','有点','比较','质量','都是','够','十分','还算','极其','也算','方面','太','算是')
#将上面的词组成“正则表达式”
pattern2 <- paste("[", paste(pattern, collapse = ","), "]", sep="")
#注意:一下区别
pattern2
## [1] "[还是,很也,了,点,可以,还,是,真心,都,相当,大家,确实,挺,非常,应该,蛮,整体,里面,就,实在,总体,听说,有点,比较,质量,都是,够,十分,还算,极其,也算,方面,太,算是]"
paste(pattern, collapse = ",")
## [1] "还是,很也,了,点,可以,还,是,真心,都,相当,大家,确实,挺,非常,应该,蛮,整体,里面,就,实在,总体,听说,有点,比较,质量,都是,够,十分,还算,极其,也算,方面,太,算是"
paste(pattern, sep=",")
## [1] "还是" "很也" "了" "点" "可以" "还" "是" "真心" "都" "相当"
## [11] "大家" "确实" "挺" "非常" "应该" "蛮" "整体" "里面" "就" "实在"
## [21] "总体" "听说" "有点" "比较" "质量" "都是" "够" "十分" "还算" "极其"
## [31] "也算" "方面" "太" "算是"
#7)剔除原始评论中含有的这些干扰词汇、语气词:把原始评论text中含有pattern2中的词都替换为空''
text2 <- gsub(pattern = pattern2, replacement = "", x=text)
#总结:经过清洗后,原始的评论相对简介而干净,下面对其进一步分词,记住,之前已经构建了自定义词汇,他会产生指定组合的词,如“酒店”,“不错”两个词组合为“酒店不错”。
5、对最终清洗后的评论重新分词
#1)重新分词
segword3 <- segmentCN(text2)
head(segword3, 3)
## [[1]]
## [1] "超级" "便" "地铁" "浦东" "道" "站"
## [7] "出口" "出来" "对" "不用" "走" "多路"
## [13] "穿" "小" "的" "马路" "只" "周边"
## [19] "最近" "施工" "看起来" "简陋" "酒店不错" "的"
## [25] "床" "舒服" "个" "安静" "私" "密"
## [31] "早餐" "自助餐" "稍微" "简单" "没什么" "特别"
## [37] "的" "寿" "司" "包饭" "不" "好吃"
## [43] "鸡蛋" "饼" "赞" "儿童" "半价" "八九"
## [49] "略" "贵" "些" "这个" "价格" "一般"
## [55] "享用" "晚餐" "毕竟" "早饭" "单调" "的"
## [61] "感觉" "不" "值" "虽然" "双" "早"
## [67] "不能" "撇下" "一个" "出去" "吃"
##
## [[2]]
## [1] "先" "好" "的" "酒店" "号" "地铁" "旁边"
## [8] "便" "酒店" "设施" "新" "房间" "拉开" "窗帘"
## [15] "看见" "黄浦江" "晚上" "开门" "回来" "空调" "已经"
## [22] "预先" "打开" "一" "开门" "凉快" "这" "不错"
## [29] "再" "不足" "的" "堂" "服务员" "不热情" "没"
## [36] "笑脸" "不" "没" "清洁" "阿姨" "的" "笑容"
## [43] "多" "前台" "办理" "入住" "和" "退" "房"
## [50] "速度" "慢" "感觉" "业务" "生疏" "我" "办"
## [57] "好" "入住" "拿" "房" "卡" "居然" "开"
## [64] "不" "房" "清洁" "阿姨" "帮忙" "开" "的"
## [71] "后来" "出去" "到" "前台" "重新" "弄" "下"
## [78] "等" "晚上" "回来" "居然" "打" "不" "开"
## [85] "只好" "下" "楼" "又" "重新" "刷" "下"
## [92] "才" "打开" "门" "希望" "下次" "所" "改进"
##
## [[3]]
## [1] "酒店" "号" "线" "出来" "交通" "便" "离"
## [8] "陆" "嘴" "八佰伴" "近" "最近" "修路" "乱"
## [15] "楼层" "高" "靠" "黄埔" "江" "边" "不错"
## [22] "房间" "装修" "新" "的" "打扫" "干净" "附近"
## [29] "没什么" "吃饭" "的" "下次" "再" "来"
#2)新建停止词:此时根据业务场景
stopwords_v2 <- c('不错','酒店','交通','前台','出差','价','去','免费','入','入住','大道','吃','退','上海','说','床','态度','升级','地理','很好','号','住','服务员','房间','服务','设施','环境','位置')
#3)把新建的停止词添加到原停止词中
mystopwords <- c(mystopwords, stopwords_v2)
#4)删除停止词:
segword4 <- sapply(segword3, removewords, mystopwords)
#查看已删除后的分词结果
length(segword4)
## [1] 1500
segword4[[1]]
## [1] "地铁" "浦东" "出口" "走" "多路" "穿"
## [7] "马路" "施工" "简陋" "酒店不错" "舒服" "安静"
## [13] "密" "早餐" "自助餐" "简单" "包饭" "好吃"
## [19] "鸡蛋" "饼" "儿童" "半价" "八九" "享用"
## [25] "晚餐" "早饭" "单调" "感觉" "撇下"
6、根据上面的分词结果,再一次绘制词云
#1)获取词频
word_freq2 <- getWordFreq(unlist(segword4))
#2)绘制词云
opar <- par(no.readonly = T)
par(bg="white")
wordcloud(words=word_freq2$Word, freq = word_freq2$Freq, scale = c(4, 0.1), max.words = 50, random.color = T, colors = rainbow(7))
par(opar)
#从上图可知:发现还是有一些词影响了其真实情况,如“早餐”,"房"等,需要进一步将其纳入停止词,因为这些词之前已经被组合成其他词汇。
#3)再次定义新的停止词,并删除
stopwords_v3 <- c('早餐','嘴','电话','订','楼','人员','钟','修','办理','客人','品种','朋友','带','出门','房','影响','硬件','感觉','想','验','洁','希望','送')
segword5 <- sapply(segword4, removewords, stopwords_v3)
#查看分词结果
segword5[[1]]
## [1] "地铁" "浦东" "出口" "走" "多路" "穿"
## [7] "马路" "施工" "简陋" "酒店不错" "舒服" "安静"
## [13] "密" "自助餐" "简单" "包饭" "好吃" "鸡蛋"
## [19] "饼" "儿童" "半价" "八九" "享用" "晚餐"
## [25] "早饭" "单调" "撇下"
#4)根据这次剔除的停止词,我们再绘制词云
#从新获取词频
word_freq3 <- getWordFreq(unlist(segword5))
#绘制词云
opar <- par(no.readonly = T)
par(bg="white")
wordcloud(words = word_freq3$Word, freq = word_freq3$Freq, scale=c(4, 0.1), max.words = 50, random.color = T, colors = rainbow(7))
par(opar)
#从上图可知:发现文字云中含有相同意思的词汇,如“推荐”和“值得推荐”,这就要将这样的词汇合并为一个词汇
7、合并语义相同的词
#1)合并同义词:例如,将推荐和值得推荐合并
segword6 <- unlist(segword5)
segword6[segword6=="推荐"] <- "值得推荐"
#2)重新绘制词云
#获取词频
word_freq4 <- getWordFreq(unlist(segword6))
#绘制词云
opar <- par(no.readonly = T)
par(bg="white")
wordcloud(words = word_freq4$Word, freq = word_freq4$Freq, scale = c(4, 0.1), max.words = 50, random.color = T, colors = rainbow(7))
par(opar)
#注意:上面使用R的wordcloud包绘制文字云,也可以使用工具tagxedo绘制,绘制之前需要将txt文件输入该工具中
8、将文本挖掘结果存储到磁盘
write.table(head(word_freq4, 50), "F:\\R\\Rworkspace\\文本挖掘\\携程网评论数据分析/word_freq.txt", row.names=F, sep="", quote=F)
Rmarkdown脚本: http://pan.baidu.com/s/1jHk1Jj4
参考资料:刘顺祥祥之文件挖掘