1<wbr></wbr>文本挖掘概要
文本挖掘是从大量的文本数据中抽取隐含的,求和的,可能有用的信息。
通过文本挖掘实现
•Associate:关联分析,根据同时出现的频率找出关联规则
•Cluster:将相似的文档(词条)进行聚类
•Categorize:将文本划分到预先定义的类别里
•Summarize:提取全面准确反映文档中心内容的简单连贯描述性短文、关键词。
文本挖掘应用:
•智能信息检索:同义词,简称词,异形词,同音字、赘字移除
•网络内容安全:内容监控,内容过滤
•内容管理:自动分类,检测和追踪
•市场监测:口碑监测,竞争情报系统,市场分析
2 文本挖掘流程
文本处理流程首先要拥有分析的语料(text corpus),比如报告、信函、出版物等。而后根据这些语料建立半结构化的文本库(text database)。然后生成包含词频的结构化的词条-文档矩阵(term-document matrix)
文本挖掘需要的相关包:
tm包是R语言中为文本挖掘提供综合性处理的package,进行操作前载入tm包,vignette命令可以查看相关文档说明: library(tm);<wbr>vignette("tm")</wbr>
tm package很多函数依赖于其它package,需将rJava, Snowball, zoo, XML, slam, Rz, Rweka,matlab这些win32 package一并下载,并解压到默认的library中去,到SUN网站安装JAVA运行环境软件包
Java 环境安装后执行如下R代码:
Sys.setenv(JAVA_HOME='C:\Program Files (x86)\Java\jdk1.7.0_15\jre')
加载如下包:
library(rJava):Java环境
library(Rwordseg) #中文分词
library(tm):tm 包
library(Rcpp)#wordcloud相关
library(RColorBrewer)#wordcloud相关
library(wordcloud)#wordcloud相关
3 实例-R<wbr></wbr>
1、文本读入(语料库构建)
Text Corpus: 语料库代表了一系列的文档集合,分两种:
•动态语料库(Volatile Corpus,作为R 对象保存在内存中)
<wbr>•静态语料库(Permanent Corpus,R 外部保存)。</wbr>
动态语料库:Corpus(x, readerControl = list(reader = , language )
静态语料库:PCorpus(x, readerControl = list(reader = , language =), dbControl = list(dbName = “”, dbType = “DB1”))
x: 资料来源,tm相关函数:
<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>• DirSource:处理目录, DirSource()</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>• VectorSource:由文档构成的向量</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>• DataframeSource:数据框,像CSV 文件</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
readerControl:资料源创立的文本文件
<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>•reader: 资料源的文本类型,readDOC,readPDF,readPlain,readReut21578XML 等不同的读入方式,可用getReaders()查看所有类型</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>• language:字符集(“en”、“UTF-8”)</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
dbControl:静态语料库的第三种参数,声明R内存外的资料来源(如数据库及类型),
实例:dir<-"C:\\test"<wbr><wbr>#文本文档的路径</wbr></wbr>
txt<-Corpus(DirSource(dir)) #将所有路径下的文档读入作为语料库
2、文件预处理
#主要是tm_map函数
txt=tm_map(txt,as.PlainTextDocument)
txt=tm_map(txt,stripWhitespace)#去除空格
txt=tm_map(txt,tolower)#将内容转换成小写
txt=tm_map(txt, removeWords, stopwords("english"))#remove stopwords
getTransformations():查看所有字符处理方式 如下:
<wbr>"as.PlainTextDocument"(剔除标签) "removeNumbers"<wbr>(剔除数字)<wbr><wbr><wbr><wbr><wbr><wbr>"removePunctuation"(剔除标点符号)<wbr><wbr>"removeWords"<wbr><wbr>(剔除停用词)<wbr><wbr><wbr><wbr><wbr><wbr><wbr>"stemDocument"<wbr><wbr><wbr><wbr><wbr><wbr><wbr>"stripWhitespace"<wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
<wbr><span style="word-wrap:normal; word-break:normal; color:rgb(0,176,80)">##设置名字识别</span></wbr>
segment.options(isNameRecognition = TRUE)
<wbr><span style="word-wrap:normal; word-break:normal; color:rgb(0,176,80)">##中文分词</span></wbr>
dm<-segmentCN(as.character(txt))
dtm<-Corpus(VectorSource(dm)) #重新生成语料库
3、词频矩阵生成
词频矩阵生成:dtm<-DocumentTermMatrix(txt, control=list(dictionary=cnword, removePunctuation = TRUE, stopwords=TRUE,<wbr>wordLengths = c(1, Inf)))</wbr>
Punctuation是否去掉标点符号默认false
<wbr>removeNumbers是否去掉数字默认false</wbr>
dictionary设置要统计的中文词语,如果不设置的话,默认会从所有的语料库里统计。
wordLengths设置如果词的长度大于X时舍去。
剔除稀疏词汇:removeSparseTerms(dtm, sparse=0.9)
数据框格式转换:df_dtm2<-as.data.frame(inspect(dtm2))
实例:
dtm2<-TermDocumentMatrix(dtm, control=list(wordLengths = c(1,5)))<wbr>#设置显示词的最小及最大长度</wbr>
dtm3<-removeSparseTerms(dtm2, sparse=0.5)
3、文本分析
##filter
query="id=='237' & heading == 'INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE'"#设置查询条件
#从中过滤出满足query条件的文档
tm_filter(txt, FUN=sFilter, query) #A corpus with 1 text document
##频数分析
findFreqTerms(x, lowfreq = 0, highfreq = Inf)
<wbr>x: 语料库数据文本</wbr>
lowfreq :词汇频数下限
highfreq:词汇频数上限
<wbr>实例:<span style="word-wrap:normal; word-break:normal; color:rgb(0,176,80)"><span style="font-size:14px; word-wrap:normal; word-break:normal; line-height:24px"><span style="color:#000000; word-wrap:normal; word-break:normal"><span style="font-family:Calibri; word-wrap:normal; word-break:normal">findFreqTerms(dtm3,5)<wbr>作用于TermDocumentMatrix or DocumentTermMatrix ,</wbr></span><span style="word-wrap:normal; word-break:normal; line-height:21px; font-size:14px">取出频数大于等于5的词组</span></span></span></span></wbr>
##关联分析
findAssocs(x, terms, corlimit)
<wbr>x: 语料库数据文本</wbr>
<wbr>terms:词汇向量</wbr>
<wbr>corlimit:相关性系数值向量,值0~1</wbr>
实例:findAssocs(dtm4,"犯规",0.5)
##云图分析
<wbr><a target="_blank" name="OLE_LINK12" style="text-decoration:underline; color:rgb(82,102,115)"><span style="word-wrap:normal; word-break:normal"><span style="word-wrap:normal; word-break:normal; font-family:宋体"><span style="word-wrap:normal; word-break:normal; font-family:'Microsoft YaHei'"><span style="color:#000000; word-wrap:normal; word-break:normal">wordcloud(words, freq, scale = c(6, 1.5), min.freq = 2, max.words = 1000,random.order=TRUE,random.color=FALSE,rot.per=T, colors = rainbow(100)</span></span></span></span></a><span style="word-wrap:normal; word-break:normal; font-family:'Microsoft YaHei'">,<span style="color:#000000; word-wrap:normal; word-break:normal">ordered_colors=FALSE)</span></span></wbr>
words:关键词列表
freq:关键词对应的频数列表
scale:字号列表(c(最大,最小)
min.freq:显示的最小词频
max.words:词云图上显示关键词的最大数目
random.order:控制关键词的排列顺序,T:随机排列,F:按照频数从中心向外降序排列,频数大的出现在中心位置
random.color:控制关键词字体颜色,T:随机分配,F:按频数分配
rot.per:控制关键词摆放角度,T :水平,F:旋转90度
colors:字体颜色列表
ordered.colors:控制字体颜色使用顺序,T:按指定顺序给出颜色,F:任意给定颜色
实例:
dtm4<-as.matrix(dtm3)
v <- sort(rowSums(dtm4),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$words, d$freq, scale = c(6, 1.5), min.freq = 2, max.words = 1000, colors = rainbow(100))
<wbr></wbr>