使用Gensim进行主题建模（二）

最新推荐文章于 2024-03-18 15:12:10 发布

csdn产品小助手

最新推荐文章于 2024-03-18 15:12:10 发布

阅读量1.8k

点赞数

文章标签： python 人工智能

原文链接：https://juejin.im/post/5cb34ba6f265da034e7e7f20

版权

本文介绍了如何使用Gensim的Mallet LDA实现提高主题建模的质量，包括如何确定最佳主题数量、在句子中查找主要话题、分析主题文件分布，以及如何呈现和理解结果。

摘要由CSDN通过智能技术生成

在上一篇文章中，我们将使用Mallet版本的LDA算法对此模型进行改进，然后我们将重点介绍如何在给定任何大型文本语料库的情况下获得最佳主题数。

16.构建LDA Mallet模型

到目前为止，您已经看到了Gensim内置的LDA算法版本。然而，Mallet的版本通常会提供更高质量的主题。

Gensim提供了一个包装器，用于在Gensim内部实现Mallet的LDA。您只需要下载 zip 文件，解压缩它并在解压缩的目录中提供mallet的路径。看看我在下面如何做到这一点。gensim.models.wrappers.LdaMallet

# Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = 'path/to/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)
# Show Topics
pprint(ldamallet.show_topics(formatted=False))# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()print('\nCoherence Score: ', coherence_ldamallet)
[(13,
  [('god', 0.022175351915726671),
   ('christian', 0.017560827817656381),
   ('people', 0.0088794630371958616),
   ('bible', 0.008215251235200895),
   ('word', 0.0077491376899412696),
   ('church', 0.0074112053696280414),
   ('religion', 0.0071198844038407759),
   ('man', 0.0067936049221590383),
   ('faith', 0.0067469935676330757),
   ('love', 0.0064556726018458093)]),
 (1,
  [('organization', 0.10977647987951586),
   ('line', 0.10182379194445974),
   ('write', 0.097397469098389255),
   ('article', 0.082483883409554246),
   ('nntp_post', 0.079894209047330425),
   ('host', 0.069737542931658306),
   ('university', 0.066303010266865026),
   ('reply', 0.02255404338163719),
   ('distribution_world', 0.014362591143681011),
   ('usa', 0.010928058478887726)]),
 (8,
  [('file', 0.02816690014008405),
   ('line', 0.021396171035954908),
   ('problem', 0.013508104862917751),
   ('program', 0.013157894736842105),
   ('read', 0.012607564538723234),
   ('follow', 0.01110666399839904),
   ('number', 0.011056633980388232),
   ('set', 0.010522980454939631),
   ('error', 0.0101