对于海量未知内容文本的挖掘,主题分析是一个常见的技巧,在主题模型中,主题表示一个概念、一个方面,表现为一系列相关的单词,是这些单词的条件概率。形象来说,主题就是一个桶,里面装了出现概率较高的单词,这些单词与这个主题有很强的相关性。
今天要写的就是LDA(Latent Dirichlet Allocation)主题模型
The basic assumption behind LDA is that each of the documents in a collection consist of a mixture of collection-wide topics. However, in reality we observe only documents and words, not topics – the latter are part of the hidden (or latent) structure of documents. The aim is to infer the latent topic structure given the words and document. LDA does this by recreating the documents in the corpus by adjusting the relative importance of topics in documents and words in topics iteratively.
大家只要领会LDA就是给一大堆文档定主题的过程就行。今天主要看实战操作
数据预处理
写了好几篇文本挖掘的文章了,预处理基本都一样,本文只贴代码,详细请看之前的文章:
library(tm)
files <- readLines('C:/Users/hrd/Desktop/bootcamp/dataset