机器学习 -LDA模型

最新推荐文章于 2023-04-19 10:30:50 发布

广小辉

最新推荐文章于 2023-04-19 10:30:50 发布

阅读量628

点赞数

分类专栏：人工智能系列机器学习1-机器学习的基本概念

本文链接：https://blog.csdn.net/Galbraith_/article/details/104577253

版权

人工智能系列机器学习1-机器学习的基本概念专栏收录该内容

21 篇文章 0 订阅

订阅专栏

1. 模型相关参数

n_components：主题的数量。越大，topic越多，perplexity越小，也越容易过拟合。可以画出n_components vs perplexity的变化曲线来确定；
1. doc_topic_prior，文本-主题的先验分布theta，默认为 $\frac{1}{n\_components}$
1. topic_word_prior: 主题-单词先验分布beta，默认为 $\frac{1}{n\_components}$
learning_method:更新主题-单词分布的的学习方式
1. batch: Batch variational Bayes–批量变分贝叶斯方法，利用所有的数据做EM更新。
2. online：当数据量比较大时，采用min_batch来更新主题词分布。mini_batch学习率受到learning_decay和learning_offset的影响；
learning_decay: 在online学习过程中，用来控制learning_rate的衰减情况，learning_decay设置在[0.5, 1]之间来保证共轭分布的收敛性。
learning_offset: 在online 学习过程中，前面训练样本批次对最终模型的影响。
max_iter: 最大迭代次数；
batch_size：在online学习过程中，使用多少batch_size的样本进行EM迭代；
evalueate_every: 进行perplexity评估的频率，perplexity 可以比较方便地评估模型训练的收敛情况。
perp_tol: perplexity的tolerance。

2. 相关代码

lda = LatentDirichletAllocation(n_components=200, max_iter=10,
                                learning_method='online',)
lda.fit(tfidf)
# 主题词分布
feature_names = tfidf_vectorizer.get_feature_names()
topic_term = lda.components_
for j, topic in enumerate(topic_term):
    top_str = ''
    for i in range(len(feature_names)):
        top_str += '{}%*{} + '.format(round(topic[i] * 100, 3), feature_names[i])
    if j != 0 and j % 150 == 0:
        print(top_str)

在这里插入图片描述

3. 收敛效果(perplexity)

通过调用lda.perplexity(X)函数，可以得知当前训练的perplexity，sklearn中对perplexity的定义为exp(-1. * log-likelihood per word

广小辉

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
机器学习 -LDA模型

1. 模型相关参数1. n_components：主题的数量。越大，topic越多，perplexity越小，也越容易过拟合。可以画出n_components vs perplexity的变化曲线来确定；2. doc_topic_prior，文本-主题的先验分布theta，默认为$\frac{1}{n\_components}$3. topic_word_prior: 主题-单词先验分布b...
复制链接

扫一扫

专栏目录