【自然语言处理-NLP】情感分析与主题建模

云博士的AI课堂

于 2025-03-09 10:57:49 发布

阅读量1.2k

点赞数 30

分类专栏：深度学习哈佛博后带你玩转机器学习文章标签：自然语言处理人工智能情感分析主题建模深度学习机器学习 NLP

本文链接：https://blog.csdn.net/l35633/article/details/146128840

版权

哈佛博后带你玩转机器学习同时被 2 个专栏收录

239 篇文章

订阅专栏

深度学习

178 篇文章

订阅专栏

以下内容详细剖析了NLP 中情感分析（Sentiment Analysis）和主题建模（Topic Modeling）的技术与方法，分别展示如何从文本中提取情感倾向和潜在主题，并提供示例代码和讲解，可在 Python 环境下直接运行。

情感分析（Sentiment Analysis）
1.1 概念与方法概览
1.2 传统机器学习方法
1.3 深度学习与预训练模型
1.4 代码示例：基于机器学习的情感分类
主题建模（Topic Modeling）
2.1 概念与 LDA 基本原理
2.2 LDA 以外的主题建模方法
2.3 代码示例：Gensim 实现 LDA 主题建模
总结与扩展

1. 情感分析（Sentiment Analysis）

1.1 概念与方法概览

情感分析旨在判断文本在情感上的倾向，例如产品评论中的正面/负面/中性评价。

常见分类粒度

二分类（positive/negative）
多分类（positive/neutral/negative）
更细粒度的情绪标签（如愤怒、高兴、悲伤等）

主要方法

基于规则：使用情感词典或人工规则，适用于简单场景，维护成本高。
机器学习：将文本特征（Bag-of-Words、TF-IDF 等）输入分类器（逻辑回归、朴素贝叶斯、SVM 等）进行监督训练。
深度学习：
- CNN/RNN/LSTM 等网络可捕捉上下文信息，提升效果。
- 预训练大模型（BERT、GPT）在情感分析上表现优异，可进行少量微调。

1.2 传统机器学习方法

经典流程：

文本预处理：分词、去停用词、必要时词干化/词形还原
特征提取：如 Bag-of-Words、TF-IDF
训练分类器：如逻辑回归、SVM、朴素贝叶斯、随机森林
模型预测：输入新文本的向量化表示后，输出情感标签

优点：实现简单、易解释
缺点：难以捕捉深层语义，效果受限于特征工程

1.3 深度学习与预训练模型

RNN/LSTM/CNN

将文本分词并用 词嵌入 表示，然后通过 RNN/CNN 结构捕捉序列或局部特征，比传统机器学习效果更好。

预训练语言模型（BERT、GPT 等）

BERT 通过大规模预训练学习丰富语义信息，对下游情感分析任务仅需少量微调即可达高性能。

1.4 代码示例：基于机器学习的情感分类

以下示例使用 sklearn 展示简化流程：

构建模拟数据
TF-IDF 向量化
训练逻辑回归模型
预测和评估

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 1) 模拟数据
corpus = [
    ("I love this movie. It's fantastic!", "positive"),
    ("Absolutely terrible. Waste of time.", "negative"),
    ("Pretty good overall, but not the best.", "positive"),
    ("I hate this product, it's awful!", "negative"),
    ("The design is beautiful and I am satisfied.", "positive"),
    ("It's okay, not too bad, not too good.", "positive"),  # 将“中性”视为positive示例
    ("Horrible experience, I'm disappointed.", "negative"),
    ("Could be better, I'm not fully happy with it.", "negative")
]

texts = [item[0] for item in corpus]
labels = [item[1] for item in corpus]

# 2) 数据切分
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

# 3) TF-IDF 向量化
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# 4) 训练逻辑回归
clf = LogisticRegression()
clf.fit(X_train_vec, y_train)

# 5) 测试与评估
y_pred = clf.predict(X_test_vec)
print("预测结果:", y_pred)
print("真实标签:", y_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

运行后可得到分类指标，如准确率、精确率、召回率等。

2. 主题建模（Topic Modeling）

2.1 概念与 LDA 基本原理

主题建模旨在从大量无标签文本中发现潜在主题。

LDA（Latent Dirichlet Allocation）：最经典的概率主题模型
- 假设每篇文档由若干主题混合生成，每个主题由特定的词分布构成
- 通过统计词在文档中的共现关系，推断文档-主题分布及主题-词分布

2.2 LDA 以外的主题建模方法

PLSA：LDA 的前身，但缺乏先验分布
HDP：层次狄利克雷过程，可自动确定主题数
神经主题模型：结合深度学习（如 VAE 或 BERT Embeddings + 聚类）进行主题发现

2.3 代码示例：Gensim 实现 LDA 主题建模

以下示例使用 Gensim 库进行简单的 LDA 训练，演示流程。

# pip install gensim
import gensim
from gensim import corpora
import nltk
# 如果需要下载nltk资源
# nltk.download('stopwords')
# nltk.download('punkt')
from nltk.corpus import stopwords

documents = [
    "I love to watch football games. Football is a great sport!",
    "The bank is closing soon, check your bank account quickly.",
    "I prefer basketball to football, it is more dynamic.",
    "The investment bank raised interest rates yesterday.",
    "He watches basketball and football every weekend.",
    "Financial institutions are impacted by interest rate changes."
]

stop_words = set(stopwords.words('english'))

def tokenize_and_clean(text):
    tokens = nltk.word_tokenize(text.lower())
    filtered = [w for w in tokens if w.isalpha() and w not in stop_words]
    return filtered

processed_docs = [tokenize_and_clean(doc) for doc in documents]

# 构建词典
dictionary = corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=1, no_above=0.9)

# 文档转换为词袋
corpus_bow = [dictionary.doc2bow(doc) for doc in processed_docs]

from gensim.models.ldamodel import LdaModel

num_topics = 2
lda_model = LdaModel(
    corpus=corpus_bow,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    passes=10,
    alpha='auto'
)

for i in range(num_topics):
    print(f"主题 {i}:")
    print(lda_model.print_topic(i))
    print("------")

# 对新文档进行推断
new_doc = "The interest rate for bank deposits is increasing."
bow_new_doc = dictionary.doc2bow(tokenize_and_clean(new_doc))
topic_probs = lda_model.get_document_topics(bow_new_doc)
print("\n新文档主题分布:", topic_probs)