用LDA在R中聚类四本小说

最新推荐文章于 2022-11-22 14:09:59 发布

R语言中文社区

最新推荐文章于 2022-11-22 14:09:59 发布

阅读量1.2k

点赞数

本文链接：https://blog.csdn.net/kMD8d5R/article/details/85888642

版权

本文介绍了如何使用LDA在R中对四本小说进行主题建模，通过预处理文本、创建词云、进行聚类分析，展示了LDA在文本分类上的应用。通过对小说章节的无标签聚类，验证了聚类结果与原小说的对应关系，并分析了可能的错误分类情况。

摘要由CSDN通过智能技术生成

作者：汪喵行 R语言中文社区专栏作者

知乎ID：https://www.zhihu.com/people/yhannahwang

前言

在文本挖掘里面，除了情感分析，还有一个很重要的主题就是topic modeling。在生活中，有时候对于文章进行分类时，如果用topic modeling的方法，会比人工分类有效率的多。在topic modeling中，最常用的方法就是LDA（Latent Dirichlet allocation）。简单来说，这种方法可以看成：

1.把每篇文章看作是topic的集合。比如对于一个双话题模型，我们可以认为文章1有90%的可能性是话题1，10%的可能性是话题2；

2.把每个话题（topic）看成是词的集合（bag of words），比如对于话题“政治”，里面会有“政府”，“国会”之类，对于“娱乐”这个话题，里面可能会包括“电影”等等。

我们选取了四本小说：《Twenty Thousand Leagues under the Sea》《The War of the Worlds》《Pride and Prejudice 》《Great Expectations》，把四本小说的所有章节全部打乱，用这些章节来form 4 个topics。如果聚类的效果好的话，这4个topics应该是对应四本小说的。所以步骤是：

1. 首先把四本小说拆成章节并打乱去掉名字（相当于是unlabeled的凌乱的chapters），文本预处理

2.画四本小说的wordcloud

3.用这些章节去聚类4个topics

4.把所有章节带着小说名称放进4个topics里，看看我们的聚类效果如何

需要用到的packages: gutenbergr / topicmodels / Stringr / dplyr / wordcloud2 / ggplot2 / tidytext/tidyr

先library一遍

library(gutenbergr)    # for loading books
library(topicmodels)   # for modeling topics
library(stringr)       # deal with string
library(dplyr)         # do operations on table or dataframe (can do multiple operations using "%>%")
library(wordcloud2)    # draw word cloud
library(ggplot2)       # draw pictures 
library(tidytext)      # tidying model objects, extract topic-related probabilities
library(tidyr)         # tidying model object

1.加载四本书并拆分成章节

#load the four books
titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds","Pride and Prejudice", "Great Expectations")
books <- gutenberg_works(title %in% titles) %>% gutenberg_download(meta_fields = "title")

# split into chapters (with no book titles)
chapters <-   books %>% group_by(title) %>%
              mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_ca              se = TRUE)))) %>%
              ungroup() %>% filter(chapter > 0) %>%
              unite(document, title, chapter)

Output: