Python实现基于LDA的文本主题分析与情感分析

原创已于 2025-06-15 15:30:33 修改 · 854 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#信息可视化 #深度学习 #自然语言处理 #数据分析

于 2025-06-15 07:00:00 首次发布

子木工作室专栏收录该内容

6 篇文章

订阅专栏

项目背景

在当今信息爆炸的时代，文本数据呈现出爆炸式增长。从海量文本中提取有价值的信息，发现潜在的主题，并分析文本的情感倾向，成为了一个重要的研究课题。本项目基于Python开发了一个完整的文本分析系统，集成了LDA主题模型和情感分析功能，可以帮助用户快速实现文本的主题挖掘和情感分析。

系统架构

系统采用模块化设计，主要包含以下核心模块：

数据预处理模块

文本清洗
中文分词
停用词过滤

主题分析模块

LDA主题模型
主题词提取
主题可视化

情感分析模块

基于SnowNLP的情感分析
主题情感得分计算

可视化模块

词云图生成
主题分布可视化

功能特点

文本预处理

支持中文文本处理
自动去除标点符号和特殊字符
支持自定义停用词表
使用jieba进行中文分词

主题分析

基于LDA的主题建模
支持自定义主题数量
提取每个主题的关键词
计算主题分布

情感分析

基于SnowNLP的情感分析
支持句子级别情感分析
计算主题情感得分
结果可视化展示

数据可视化

生成词云图
主题分布可视化
情感得分展示
结果导出Excel

技术实现

开发环境

# 核心依赖包
pandas==1.3.0
jieba==0.42.1
wordcloud==1.8.1
matplotlib==3.4.3
scikit-learn==0.24.2
pyLDAvis==2.1.2
snownlp==0.12.3

核心代码实现

文本预处理

def pre_process(texts, stopwords):
    filtered_words_list = []
    for text in texts:
        # 清理文本
        cleaned = clean_text(text)
        # 使用jieba分词
        words = jieba.cut(cleaned)
        word_list = list(words)
        # 去除停用词
        filtered_words = remove_stopwords(word_list, stopwords)
        filtered_words_list.extend(filtered_words)
    return filtered_words_list

LDA主题模型

# 文本向量化
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(processed_texts)

# 训练LDA模型
lda_model = LatentDirichletAllocation(n_components=3, random_state=42)
lda_model.fit(doc_term_matrix)

情感分析

def get_topic_sentence_score(texts, doc_term_matrix, lda_model):
    topic_assignments = []
    for i in range(doc_term_matrix.shape[0]):
        topic_probabilities = lda_model.transform(doc_term_matrix[i])
        topic_assignments.append(topic_probabilities.argmax())
    
    topic_sentiments = {i: [] for i in range(lda_model.n_components)}
    for i, sentence in enumerate(texts):
        s = SnowNLP(sentence)
        sentiment_score = s.sentiments
        topic_sentiments[topic_assignments[i]].append(sentiment_score)
    
    return [round(sum(sentiments) / len(sentiments) if sentiments else 0, 4) 
            for topic_num, sentiments in topic_sentiments.items()]

使用说明

环境配置

# 安装依赖包
pip install -r requirements.txt

数据准备

准备Excel格式的文本数据
准备停用词表（stopword.txt）
确保安装了中文字体（SimHei.ttf）

运行示例

# 1. 加载停用词
stopwords = load_stopwords('stopword.txt')

# 2. 读取文本数据
texts = read_excel('input.xlsx')

# 3. 文本预处理
processed_texts = pre_process(texts, stopwords)

# 4. 生成词云
plot_wordcloud(processed_texts, 'wordcloud.png')

# 5. LDA主题分析
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(processed_texts)
lda_model = LatentDirichletAllocation(n_components=3, random_state=42)
lda_model.fit(doc_term_matrix)

# 6. 保存主题词
save_top_words(vectorizer, lda_model, 'top_words.xlsx')

# 7. 情感分析
score_sentence = get_topic_sentence_score(texts, doc_term_matrix, lda_model)