[python]LDA模型使用流程及代码

最新推荐文章于 2025-03-12 08:26:15 发布

alwaysluc

最新推荐文章于 2025-03-12 08:26:15 发布

阅读量2.1w

点赞数 70

分类专栏：小实验文章标签： python 自然语言处理开发语言

本文链接：https://blog.csdn.net/alwaysluc/article/details/124673115

版权

本文介绍了Python中使用LDA进行主题模型构建的详细流程，包括数据预处理、去除停用词、构建LDA模型、使用pyLDAvis进行可视化，以及如何通过困惑度和一致性得分确定最佳主题个数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据预处理

该步骤可自行处理，用excel也好，用python也罢，只要将待分析文本处理为csv或txt存储格式即可。注意：一条文本占一行

例如感想.txt：

我喜欢吃汉堡

小明喜欢吃螺蛳粉

螺蛳粉外卖好贵

以上句子来源于吃完一个汉堡还想再点碗螺蛳粉，但外卖好贵从而选择放弃的我

去除停用词

import re
import jieba as jb
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

# 对句子进行分词
def seg_sentence(sentence):
    sentence = re.sub(u'[0-9\.]+', u'', sentence)
    #jb.add_word('词汇')		# 这里是加入自定义的词来补充jieba词典
    sentence_seged = jb.cut(sentence.strip())
    stopwords = stopwordslist('自己搜来的停用词表.txt')  # 这里加载停用词的路径
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords and word.__len__()>1:
            if word != '\t':
                outstr += word
                outstr += " "
    return outstr


inputs = open('感想.txt', 'r', encoding='utf-8')

outputs = open('感想分词.txt', 'w',encoding='utf-8')
for line in inputs:
    line_seg = seg_sentence(line)  # 这里的返回值是字符串
    outputs.write(line_seg + '\n')
outputs.close()
inputs.close()

该步骤生成感想分词.txt:

我喜欢吃汉堡

小明喜欢吃螺蛳粉

螺蛳粉外卖好贵

句子来源于吃完一个汉堡再点碗螺蛳粉外卖好贵选择放弃

构建LDA模型

假设主题个数设为4个（num_topics的参数）

import codecs
from gensim import corpora
from gensim.models import LdaModel
from gensim.corpora import Dictionary


train = []

fp = codecs.open('感想分词.txt','r',encoding='utf8')
for line in fp:
    if line != '':
        line = line.split()
        train.append([w for w in line])

dictionary = corpora.Dictiona

最低0.47元/天解锁文章