gensim实现LDA(Latent Dirichlet Allocation)算法提取主题词(topic)

最新推荐文章于 2024-07-18 19:41:34 发布

limengmingx

最新推荐文章于 2024-07-18 19:41:34 发布

阅读量9.4k

点赞数 3

分类专栏： LDA 文章标签： gensim python nlp LDA 主题词提取

本文链接：https://blog.csdn.net/limengmingx/article/details/82900781

版权

该博客介绍了如何利用gensim、nltk和spacy库在Python中进行LDA主题模型的实现。内容涵盖文本预处理步骤，包括分词、lemma处理、停用词移除，以及LDA算法的应用，提取文本的主题词。同时，文中提到了在实际操作中遇到的spacy和gensim字典相关报错及其解决方案。

摘要由CSDN通过智能技术生成

Latent Dirichlet Allocation（LDA) 隐含分布作为目前最受欢迎的主题模型算法被广泛使用。LDA能够将文本集合转化为不同概率的主题集合。需要注意的是LDA是利用统计手段对主题词汇进行到的处理，是一种词袋（bag-of-words）方法。如：
输入：

第一段：“Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. It is altogether fitting and proper that we should do this.”
第二段：‘Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.’
第三段："We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that nation might live. "

输出：

(0, u'0.032*"conceive" + 0.032*"dedicate" + 0.032*"nation" + 0.032*"life"')
(1, u'0.059*"conceive" + 0.059*"score" + 0.059*"seven" + 0.059*"proposition"')
(2, u'0.103*"nation" + 0.071*"dedicate" + 0.071*"great" + 0.071*"field"')
(3, u'0.032*"conceive" + 0.032*"nation" + 0.032*"dedicate" + 0.032*"rest"')
(4, u'0.032*"conceive" + 0.032*"nation" + 0.032*"dedicate" + 0.032*"battle"')

本文将简单介绍如何使用Python 的nltk、spacy、gensim包，实现包括预处理流程在内的LDA算法。

1. 预处理：

1.1 分词处理

#第一次使用需要首先下载en包:
#python -m spacy download en
import spacy
spacy.load('en_core_web_sm')
from spacy.lang.en import English
parser = English()
#对文章内容进行清洗并将单词统一降为小写
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url