主题模型分析

最新推荐文章于 2023-11-03 18:05:44 发布

rose~Fxl

最新推荐文章于 2023-11-03 18:05:44 发布

阅读量1.6k

点赞数 2

分类专栏： Scikit-learn

本文链接：https://blog.csdn.net/weixin_61083660/article/details/126294542

版权

Scikit-learn 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

链接入口：【python-sklearn】中文文本 | 主题模型分析-LDA(Latent Dirichlet Allocation)_哔哩哔哩_bilibili

详细版代码入口：如何用Python从海量文本抽取主题？

概念

主题：自动将文本语料库编码为一组具有实质性意义的类别

主题分析的典型代表：隐含狄利克雷分布（LDA）

LDA

最明显的特征：能够将若干文档自动编码分类为一定数量的主题。

主题数量需要人为确定主题数量

原理

通过对比新旧文档来判断模型的好坏，然后在不同参数的很多模型找到最优模型。

代码

导入sklearn模块：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

定义函数print_top_words：

def print_top_words(model, feature_names, n_top_words):
    tword = []
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        topic_w = " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        tword.append(topic_w)
        print(topic_w)
    return tword

将数据进行转化：

CountVectorizer统计词频，文中代码限定term出现次数必须大于10。

n_features = 1000 #提取1000个特征词语
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                max_features=n_features,
                                stop_words='english',
                                max_df = 0.5,
                                min_df = 10)
tf = tf_vectorizer.fit_transform(data.content_cutted)

主题模型训练

n_topics = 8#人为定义主题数量
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=50,
                                learning_method='batch',
                                learning_offset=50,
#                                 doc_topic_prior=0.1,
#                                 topic_word_prior=0.01,
                               random_state=0)
lda.fit(tf)

LatentDirichletAllocation函数：

n_components=n_topics 表示隐含主题数K

max_iter=50 表示EM算法的最大迭代次数为10
learning_method='batch' 表示LDA的求解算法为 ‘batch’，变分推断EM算法
learning_offset=50 表示仅仅在算法使用”online”时有意义。用来减小前面训练样本批次对最终模型的影响。

α相当于 doc_topic_prior，β相当于topic_word_prior，前者我们一般定义为0.1，后者为0.01。

α，β如果不设置，则都默认为1/n_topics

方法：

fit(X[, y])：利用训练数据训练模型，输入的X为文本词频统计矩阵。

n_top_words = 25#打印每个主题下面的前25个词语
tf_feature_names = tf_vectorizer.get_feature_names()
topic_word = print_top_words(lda, tf_feature_names, n_top_words)

运行结果：（主题的名字需要自定义）

确定最优的主题数量

主题数量不能过多，不然会导致过分分类

可视化

import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()
pic = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
pyLDAvis.display(pic)
pyLDAvis.save_html(pic, 'lda_pass'+str(n_topics)+'.html')
pyLDAvis.display(pic)
#去工作路径下找保存好的html文件
#和视频里讲的不一样，目前这个代码不需要手动中断运行，可以快速出结果

结果显示：

每个圈代表一个主题，主题之间相距越远，表示主题之间相似度越小，分化效果越好。

主题困惑度（更严谨）

一般来说，最低的主题困惑度对应的主题数是最优的。

import matplotlib.pyplot as plt

plexs = []
scores = []
n_max_topics = 16
for i in range(1,n_max_topics):
    print(i)
    lda = LatentDirichletAllocation(n_components=i, max_iter=50,
                                    learning_method='batch',
                                    learning_offset=50,random_state=0)
    lda.fit(tf)
    plexs.append(lda.perplexity(tf))
    scores.append(lda.score(tf))

n_t=15#区间最右侧的值。注意：不能大于n_max_topics
x=list(range(1,n_t+1))
plt.plot(x,plexs[0:n_t])
plt.xlabel("number of topics")
plt.ylabel("perplexity")
plt.show()

结果显示：

根据手肘法，缩减横向坐标的取值范围，可以发现x=8时，有一个低谷，我觉得是相当于极小值。

关于困惑度可参考：折肘法+困惑度确定LDA主题模型的主题数_巴基海贼王的博客-CSDN博客_lda主题数

输出结果

import numpy as np

topics=lda.transform(tf)

topic = []
for t in topics:
    topic.append("Topic #"+str(list(t).index(np.max(t))))
data['概率最大的主题序号']=topic
data['每个主题对应概率']=list(topics)
data.to_excel("data_topic.xlsx",index=False)

rose~Fxl

关注

2
点赞
踩
22

收藏

觉得还不错? 一键收藏
5
评论
主题模型分析

链接入口：【python-sklearn】中文文本 | 主题模型分析-LDA(Latent Dirichlet Allocation)_哔哩哔哩_bilibili概念主题：自动将文本语料库编码为一组具有实质性意义的类别主题分析的典型代表：隐含狄利克雷分布（LDA）最明显的特征：能够将若干文档自动编码分类为一定数量的主题。主题数量需要人为确定主题数量通过对比新旧文档来判断模型的好坏，然后在不同参数的很多模型找到最优模型。定义函数print_top_words：将数据进行转化：
复制链接

扫一扫

专栏目录