在上一篇文章中,我们将使用Mallet版本的LDA算法对此模型进行改进,然后我们将重点介绍如何在给定任何大型文本语料库的情况下获得最佳主题数。
16.构建LDA Mallet模型
到目前为止,您已经看到了Gensim内置的LDA算法版本。然而,Mallet的版本通常会提供更高质量的主题。
Gensim提供了一个包装器,用于在Gensim内部实现Mallet的LDA。您只需要下载 zip 文件,解压缩它并在解压缩的目录中提供mallet的路径。看看我在下面如何做到这一点。gensim.models.wrappers.LdaMallet
# Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = 'path/to/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)
# Show Topics
pprint(ldamallet.show_topics(formatted=False))# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()print('\nCoherence Score: ', coherence_ldamallet)
[(13,
[('god', 0.022175351915726671),
('christian', 0.017560827817656381),
('people', 0.0088794630371958616),
('bible', 0.008215251235200895),
('word', 0.0077491376899412696),
('church', 0.0074112053696280414),
('religion', 0.0071198844038407759),
('man', 0.0067936049221590383),
('faith', 0.0067469935676330757),
('love', 0.0064556726018458093)]),
(1,
[('organization', 0.10977647987951586),
('line', 0.10182379194445974),
('write', 0.097397469098389255),
('article', 0.082483883409554246),
('nntp_post', 0.079894209047330425),
('host', 0.069737542931658306),
('university', 0.066303010266865026),
('reply', 0.02255404338163719),
('distribution_world', 0.014362591143681011),
('usa', 0.010928058478887726)]),
(8,
[('file', 0.02816690014008405),
('line', 0.021396171035954908),
('problem', 0.013508104862917751),
('program', 0.013157894736842105),
('read', 0.012607564538723234),
('follow', 0.01110666399839904),
('number', 0.011056633980388232),
('set', 0.010522980454939631),
('error', 0.0101