R文本挖掘：文本主题分析topic analysis

最新推荐文章于 2022-05-09 17:18:00 发布

Mrrunsen

最新推荐文章于 2022-05-09 17:18:00 发布

阅读量660

点赞数

分类专栏： R语言大学作业文章标签： r语言

本文链接：https://blog.csdn.net/Mrrunsen/article/details/123106088

版权

R语言大学作业专栏收录该内容

1394 篇文章 6100 订阅 ¥9.90 ¥99.00

订阅专栏

超级会员免费看

对于海量未知内容文本的挖掘，主题分析是一个常见的技巧，在主题模型中，主题表示一个概念、一个方面，表现为一系列相关的单词，是这些单词的条件概率。形象来说，主题就是一个桶，里面装了出现概率较高的单词，这些单词与这个主题有很强的相关性。

今天要写的就是LDA（Latent Dirichlet Allocation）主题模型

The basic assumption behind LDA is that each of the documents in a collection consist of a mixture of collection-wide topics. However, in reality we observe only documents and words, not topics – the latter are part of the hidden (or latent) structure of documents. The aim is to infer the latent topic structure given the words and document. LDA does this by recreating the documents in the corpus by adjusting the relative importance of topics in documents and words in topics iteratively.

大家只要领会LDA就是给一大堆文档定主题的过程就行。今天主要看实战操作

数据预处理

写了好几篇文本挖掘的文章了，预处理基本都一样，本文只贴代码，详细请看之前的文章：

library(tm)
files <- readLines('C:/Users/hrd/Desktop/bootcamp/dataset

了解本专栏

超级会员免费看

Mrrunsen

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
R文本挖掘：文本主题分析topic analysis

对于海量未知内容文本的挖掘，主题分析是一个常见的技巧，在主题模型中，主题表示一个概念、一个方面，表现为一系列相关的单词，是这些单词的条件概率。形象来说，主题就是一个桶，里面装了出现概率较高的单词，这些单词与这个主题有很强的相关性。今天要写的就是LDA（Latent Dirichlet Allocation）主题模型The basic assumption behind LDA is that each of the documents in a collection consist of a mi
复制链接

扫一扫