第20、21章潜在狄利克雷分配(LDA)和PageRank算法

白鸟坠入密林

于 2024-07-19 16:05:31 发布

阅读量833

点赞数 19

分类专栏：机器学习文章标签：潜在狄利克雷分配 LDA 主题分析微词云 PageRank算法 gensim pyLDAvis

本文链接：https://blog.csdn.net/m0_56676945/article/details/140552381

版权

机器学习专栏收录该内容

28 篇文章 4 订阅

订阅专栏

一、潜在狄利克雷分配(LDA)

写在前面：

这一章书上的数学推导实在无法全部看懂，特别是吉布斯抽样和变分EM算法那里，也太逆天了吧，一切都毁了！难道就满足于会调包吗？

这一章大概内容可见文章：30、潜在狄利克雷分配

其他文章：

一文详解潜在狄利克雷分配LDA

文本主题模型之LDA(一) LDA基础

LDA主题模型简介及Python实现

本章概要：

1.狄利克雷分布的概率密度函数为 $\theta | \alpha ) = \frac { \Gamma ( \sum _ { i = 1 } ^ { k } \alpha _ { i } ) } { \prod _ { i = 1 } ^ { k } \Gamma ( \alpha _ { i } ) } \prod _ { i = 1 } ^ { k } \theta _ { i } ^ { \alpha _ { i } - 1 }$
其中 $\sum _ { i = 1 } ^ { k } \theta _ { i } = 1 , \theta _ { i } \geq 0 , \alpha = ( \alpha _ { 1 } , \alpha _ { 2 } , \cdots , \alpha _ { k } ) , \alpha _ { i } > 0 , i = 1,2 , \cdots ,$ 狄利克雷分布是多项分布的共轭先验。

2.潜在狄利克雷分配（LDA）是文本集合的生成概率模型。模型假设话题由单词的多项分布表示，文本由话题的多项分布表示，单词分布和话题分布的先验分布都是狄利克雷分布。LDA模型属于概率图模型可以由板块表示法表示LDA模型中，每个话题的单词分布、每个文本的话题分布、文本的每个位置的话题是隐变量，文本的每个位置的单词是观测变量。

3.LDA生成文本集合的生成过程如下：

（1）话题的单词分布：随机生成所有话题的单词分布，话题的单词分布是多项分布，其先验分布是狄利克雷分布。

（2）文本的话题分布：随机生成所有文本的话题分布，文本的话题分布是多项分布，其先验分布是狄利克雷分布。

（3）文本的内容：随机生成所有文本的内容。在每个文本的每个位置，按照文本的话题分布随机生成一个话题，再按照该话题的单词分布随机生成一个单词。

4.LDA模型的学习与推理不能直接求解。通常采用的方法是吉布斯抽样算法和变分EM算法，前者是蒙特卡罗法而后者是近似算法。

5.LDA的收缩的吉布斯抽样算法的基本想法如下。目标是对联合概率分布 $\theta , \varphi | \alpha , \beta )$ 进行估计。通过积分求和将隐变量 $\theta$ 和 $\varphi$ 消掉，得到边缘概率分布 $\alpha , \beta )$ ；对概率分布 $\alpha , \beta )$ 进行吉布斯抽样，得到分布 $\alpha , \beta )$ 的随机样本；再利用样本对变量 $z$ ， $\theta$ 和 $\varphi$ 的概率进行估计，最终得到LDA模型 $\theta , \varphi | \alpha , \beta )$ 的参数估计。具体算法如下对给定的文本单词序列，每个位置上随机指派一个话题，整体构成话题系列。然后循环执行以下操作。对整个文本序列进行扫描，在每一个位置上计算在该位置上的话题的满条件概率分布，然后进行随机抽样，得到该位置的新的话题，指派给这个位置。

6.变分推理的基本想法如下。假设模型是联合概率分布 $p (x, z)$ ，其中 $x$ 是观测变量（数据）， $z$ 是隐变量。目标是学习模型的后验概率分布 $p (z ∣ x)$ 。考虑用变分分布 $q (z)$ 近似条件概率分布 $p (z ∣ x)$ ，用KL散度计算两者的相似性找到与 $p (z ∣ x)$ 在KL散度意义下最近的 $q ^ { * } ( z )$ ，用这个分布近似 $p (z ∣ x)$ 。假设 $q (z)$ 中的 $z$ 的所有分量都是互相独立的。利用Jensen不等式，得到KL散度的最小化可以通过证据下界的最大化实现。因此，变分推理变成求解以下证据下界最大化问题：
$\theta ) = E _ { q } [ \operatorname { log } p ( x , z | \theta ) ] - E _ { q } [ \operatorname { log } q ( z ) ]$

7.LDA的变分EM算法如下。针对LDA模型定义变分分布，应用变分EM算法。目标是对证据下界 $\gamma , \eta , \alpha , \varphi )$ 进行最大化，其中 $\alpha$ 和 $\varphi$ 是模型参数， $\gamma$ 和 $\eta$ 是变分参数。交替迭代E步和M步，直到收敛。

（1）E步：固定模型参数 $\alpha$ ， $\varphi$ ，通过关于变分参数 $\gamma$ ， $\eta$ 的证据下界的最大化，估计变分参数 $\gamma$ ， $\eta$ 。
（2）M步：固定变分参数 $\gamma$ ， $\eta$ ，通过关于模型参数 $\alpha$ ， $\varphi$ 的证据下界的最大化，估计模型参数 $\alpha$ ， $\varphi$ 。

习题20.2

针对17.2.2节的文本例子，使用LDA模型进行话题分析。
在这里插入图片描述

自编程

import numpy as np


class GibbsSamplingLDA:

    def __init__(self, iter_max=1000):
        self.iter_max = iter_max
        self.weights_ = []

    def fit(self, words, K):
        """
        :param words: 单词-文本矩阵
        :param K: 话题个数
        :return: 文本话题序列z
        """
        # M, Nm分别为文本个数和单词个数
        words = words.T
        M, Nm = words.shape

        # 初始化超参数alpha, beta，其中alpha为文本的话题分布相关参数
        # beta为话题的单词分布相关参数
        alpha = np.array([1 / K] * K)
        beta = np.array([1 / Nm] * Nm)

        # 初始化参数theta, varphi，其中theta为文本关于话题的多项分布参数，
        # varphi为话题关于单词的多项分布参数
        theta = np.zeros([M, K])
        varphi = np.zeros([K, Nm])

        # 输出文本的话题序列z
        z = np.zeros(words.shape, dtype='int')

        # (1)设所有计数矩阵的元素n_mk、n_kv，计数向量的元素n_m、n_k初值为 0
        n_mk = np.zeros([M, K])
        n_kv = np.zeros([K, Nm])
        n_m = np.zeros(M)
        n_k = np.zeros(K)

        # (2)对所有M个文本中的所有单词进行循环
        for m in range(M):
            for v in range(Nm):
                # 如果单词v存在于文本m
                if words[m, v] != 0:
                    # (2.a)抽样话题
                    z[m, v] = np.random.choice(list(range(K)))
                    # 增加文本-话题计数
                    n_mk[m, z[m, v]] += 1
                    # 增加文本-话题和计数
                    n_m[m] += 1
                    # 增加话题-单词计数
                    n_kv[z[m, v], v] += 1
                    # 增加话题-单词和计数
                    n_k[z[m, v]] += 1

        # (3)对所有M个文本中的所有单词进行循环，直到进入燃烧期
        zi = 0
        for i in range(self.iter_max):
            for m in range(M):
                for v in range(Nm):
                    # (3.a)如果单词v存在于文本m，那么当前单词是第v个单词，
                    # 话题指派z_mv是第k个话题
                    if words[m, v] != 0:
                        # 减少计数
                        n_mk[m, z[m, v]] -= 1
                        n_m[m] -= 1
                        n_kv[z[m, v], v] -= 1
                        n_k[z[m, v]] -= 1

                        # (3.b)按照满条件分布进行抽样
                        max_zi_value, max_zi_index = -float('inf'), z[m, v]
                        for k in range(K):
                            zi = ((n_kv[k, v] + beta[v]) / (n_kv[k, :].sum() + beta.sum())) * \
                                 ((n_mk[m, k] + alpha[k]) / (n_mk[m, :].sum() + alpha.sum()))

                        # 得到新的第 k‘个话题，分配给 z_mv
                        if max_zi_value < zi:
                            max_zi_value, max_zi_index = zi, k
                            z[m, v] = max_zi_index

                        # (3.c) (3.d)增加计数并得到两个更新的计数矩阵的n_kv和n_mk
                        n_mk[m, z[m, v]] += 1
                        n_m[m] += 1
                        n_kv[z[m, v], v] += 1
                        n_k[z[m, v]] += 1

        # (4)利用得到的样本计数，计算模型参数
        for m in range(M):
            for k in range(K):
                theta[m, k] = (n_mk[m, k] + alpha[k]) / (n_mk[m, :].sum() +
                                                         alpha.sum())

        for k in range(K):
            for v in range(Nm):
                varphi[k, v] = (n_kv[k, v] + beta[v]) / (n_kv[k, :].sum() +
                                                         beta.sum())

        self.weights_ = [varphi, theta]
        return z.T, n_kv, n_mk

gibbs_sampling_lda = GibbsSamplingLDA(iter_max=1000)

# 输入文本-单词矩阵，共有9个文本，11个单词
words = np.array([[0, 0, 1, 1, 0, 0, 0, 0, 0],
                  [0, 0, 0, 0, 0, 1, 0, 0, 1],
                  [0, 1, 0, 0, 0, 0, 0, 1, 0],
                  [0, 0, 0, 0, 0, 0, 1, 0, 1],
                  [1, 0, 0, 0, 0, 1, 0, 0, 0],
                  [1, 1, 1, 1, 1, 1, 1, 1, 1],
                  [1, 0, 1, 0, 0, 0, 0, 0, 0],
                  [0, 0, 0, 0, 0, 0, 1, 0, 1],
                  [0, 0, 0, 0, 0, 2, 0, 0, 1],
                  [1, 0, 1, 0, 0, 0, 0, 1, 0],
                  [0, 0, 0, 1, 1, 0, 0, 0, 0]])

K = 3  # 假设话题数量为3

# 设置精度为3
np.set_printoptions(precision=3, suppress=True)

z, n_kv, n_mk = gibbs_sampling_lda.fit(words, K)
varphi = gibbs_sampling_lda.weights_[0]
theta = gibbs_sampling_lda.weights_[1]

print("文本的话题序列z：")
print(z)
print("样本的计数矩阵N_KV：")
print(n_kv)
print("样本的计数矩阵N_MK：")
print(n_mk)
print("模型参数varphi：")
print(varphi)
print("模型参数theta：")
print(theta)

文本的话题序列z：
[[0 0 2 2 0 0 0 0 0]
 [0 0 0 0 0 2 0 0 2]
 [0 2 0 0 0 0 0 2 0]
 [0 0 0 0 0 0 2 0 2]
 [2 0 0 0 0 2 0 0 0]
 [2 2 2 2 2 2 2 2 2]
 [2 0 2 0 0 0 0 0 0]
 [0 0 0 0 0 0 2 0 2]
 [0 0 0 0 0 2 0 0 2]
 [2 0 2 0 0 0 0 2 0]
 [0 0 0 2 2 0 0 0 0]]
样本的计数矩阵N_KV：
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [2. 2. 2. 2. 2. 9. 2. 2. 2. 3. 2.]]
样本的计数矩阵N_MK：
[[0. 0. 4.]
 [0. 0. 2.]
 [0. 0. 4.]
 [0. 0. 3.]
 [0. 0. 2.]
 [0. 0. 4.]
 [0. 0. 3.]
 [0. 0. 3.]
 [0. 0. 5.]]
模型参数varphi：
[[0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091]
 [0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091]
 [0.067 0.067 0.067 0.067 0.067 0.293 0.067 0.067 0.067 0.1   0.067]]
模型参数theta：
[[0.067 0.067 0.867]
 [0.111 0.111 0.778]
 [0.067 0.067 0.867]
 [0.083 0.083 0.833]
 [0.111 0.111 0.778]
 [0.067 0.067 0.867]
 [0.083 0.083 0.833]
 [0.083 0.083 0.833]
 [0.056 0.056 0.889]]

使用sklearn

import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.preprocessing import normalize
from sklearn.metrics import pairwise_distances

# 输入文本-单词矩阵，共有11个单词，9个文档
words = np.array([
    [0, 0, 1, 1, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 1, 0, 0, 1],
    [0, 1, 0, 0, 0, 0, 0, 1, 0],
    [0, 0, 0, 0, 0, 0, 1, 0, 1],
    [1, 0, 0, 0, 0, 1, 0, 0, 0],
    [1, 1, 1, 1, 1, 1, 1, 1, 1],
    [1, 0, 1, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 1, 0, 1],
    [0, 0, 0, 0, 0, 2, 0, 0, 1],
    [1, 0, 1, 0, 0, 0, 0, 1, 0],
    [0, 0, 0, 1, 1, 0, 0, 0, 0]
])

# 转置矩阵，使得行表示文档，列表示单词
words = words.T

# 设定话题数量
K = 3

# 创建LDA模型
lda = LatentDirichletAllocation(n_components=K, random_state=0)

# 拟合模型
lda.fit(words)

# 对每个主题的词分布进行归一化
normalized_components = normalize(lda.components_, norm='l1', axis=1)

# 输出每个主题的词分布
print("每个主题的词分布:")
for topic_idx, topic in enumerate(normalized_components):
    print(f"主题 #{topic_idx + 1}:")
    print(" ".join([f"词{word_idx}: {prob:.4f}" for word_idx, prob in enumerate(topic)]))

# 输出每个文档的主题分布
doc_topic_distr = lda.transform(words)
print("\n每个文档的主题分布:")
for doc_idx, topic_dist in enumerate(doc_topic_distr):
    print(f"文档 #{doc_idx + 1}:")
    print(" ".join([f"主题{topic_idx + 1}: {prob:.4f}" for topic_idx, prob in enumerate(topic_dist)]))

# 输出模型的困惑度
print("\n模型困惑度:")
print(lda.perplexity(words))

# 输出每个主题的词分布的指数化概率（以便进一步分析）
exp_doc_topic_distr = np.exp(doc_topic_distr)
print("\n每个文档的主题分布的指数化概率:")
for doc_idx, topic_dist in enumerate(exp_doc_topic_distr):
    print(f"文档 #{doc_idx + 1}:")
    print(" ".join([f"主题{topic_idx + 1}: {prob:.4f}" for topic_idx, prob in enumerate(topic_dist)]))

# 计算并输出主题间的相似性
similarity = 1 - pairwise_distances(normalized_components, metric='cosine')
print("\n主题间的相似性:")
print(similarity)

每个主题的词分布:
主题 #1:
词0: 0.0974 词1: 0.0972 词2: 0.0245 词3: 0.0244 词4: 0.0981 词5: 0.2434 词6: 0.0244 词7: 0.0244 词8: 0.1714 词9: 0.0244 词10: 0.1703
主题 #2:
词0: 0.0800 词1: 0.0201 词2: 0.1400 词3: 0.0201 词4: 0.0794 词5: 0.2600 词6: 0.1401 词7: 0.0201 词8: 0.0201 词9: 0.2001 词10: 0.0201
主题 #3:
词0: 0.0287 词1: 0.1144 词2: 0.0287 词3: 0.1998 词4: 0.0287 词5: 0.2007 词6: 0.0286 词7: 0.1998 词8: 0.1131 词9: 0.0286 词10: 0.0287

每个文档的主题分布:
文档 #1:
主题1: 0.0735 主题2: 0.8578 主题3: 0.0687
文档 #2:
主题1: 0.1206 主题2: 0.7614 主题3: 0.1180
文档 #3:
主题1: 0.0734 主题2: 0.8579 主题3: 0.0687
文档 #4:
主题1: 0.8186 主题2: 0.0943 主题3: 0.0871
文档 #5:
主题1: 0.7582 主题2: 0.1231 主题3: 0.1187
文档 #6:
主题1: 0.8784 主题2: 0.0592 主题3: 0.0624
文档 #7:
主题1: 0.0896 主题2: 0.0904 主题3: 0.8200
文档 #8:
主题1: 0.0876 主题2: 0.8258 主题3: 0.0866
文档 #9:
主题1: 0.0658 主题2: 0.0582 主题3: 0.8760

模型困惑度:
19.391093970211013

每个文档的主题分布的指数化概率:
文档 #1:
主题1: 1.0763 主题2: 2.3579 主题3: 1.0712
文档 #2:
主题1: 1.1281 主题2: 2.1413 主题3: 1.1253
文档 #3:
主题1: 1.0761 主题2: 2.3582 主题3: 1.0712
文档 #4:
主题1: 2.2673 主题2: 1.0989 主题3: 1.0910
文档 #5:
主题1: 2.1344 主题2: 1.1310 主题3: 1.1261
文档 #6:
主题1: 2.4071 主题2: 1.0610 主题3: 1.0644
文档 #7:
主题1: 1.0938 主题2: 1.0946 主题3: 2.2704
文档 #8:
主题1: 1.0915 主题2: 2.2837 主题3: 1.0904
文档 #9:
主题1: 1.0680 主题2: 1.0599 主题3: 2.4014

主题间的相似性:
[[1.    0.647 0.678]
 [0.647 1.    0.536]
 [0.678 0.536 1.   ]]

两个案例

例1

from gensim import corpora, similarities
from gensim.models import LdaModel, TfidfModel, CoherenceModel
# from gensim.parsing.preprocessing import preprocess_string # 针对英文的预处理函数，可以进行分词、去除停用词等操作
import pyLDAvis.gensim_models  # 可视化工具
from pprint import pprint
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import warnings

warnings.filterwarnings('ignore')

# 读取文档，一行一篇文章
# 1、进行文本预处理，例如进行分词和删除停用词。中文使用jieba库进行分词或者用微词云网站
with open('./data/LDA_test.txt') as f:
    stop_list = set('for a of the and to in'.split())  # 一般是加载已有的停用词表
    text = [[
        word for word in line.strip().lower().split() if word not in stop_list
    ] for line in f]
    print('分词后的文本：')
    pprint(text)

分词后的文本：
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

# 2、构建词典，语料向量化表示
dictionary = corpora.Dictionary(text)
corpus = [dictionary.doc2bow(doc) for doc in text]
print('原词袋表示：')
for c in corpus:  # 表示为第几个单词出现了几次
    print(c)

# 计算 TF-IDF
tfidf = TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
print('转换成TF-IDF：')
for doc in corpus_tfidf:
    print(doc)

原词袋表示：
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
[(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
[(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)]
[(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)]
[(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
[(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
[(24, 1), (26, 1), (27, 1), (28, 1)]
[(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
[(9, 1), (26, 1), (30, 1)]
转换成TF-IDF：
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.2944198962221451), (3, 0.2944198962221451), (4, 0.2944198962221451), (5, 0.4301019571350565), (6, 0.4301019571350565)]
[(2, 0.3726494271826947), (7, 0.5443832091958983), (8, 0.3726494271826947), (9, 0.3726494271826947), (10, 0.27219160459794917), (11, 0.3726494271826947), (12, 0.27219160459794917)]
[(4, 0.438482464916089), (10, 0.32027755044706185), (12, 0.32027755044706185), (13, 0.438482464916089), (14, 0.6405551008941237)]
[(3, 0.3449874408519962), (10, 0.5039733231394895), (13, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(8, 0.30055933182961736), (11, 0.30055933182961736), (12, 0.21953536176370683), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.24253562503633297), (25, 0.48507125007266594)]
[(24, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(24, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.40932115139771735), (30, 0.2801947048062438), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(9, 0.6282580468670046), (26, 0.45889394536615247), (30, 0.6282580468670046)]

# 3、训练 LDA 模型
num_topics = 3  # 指定要找出的主题数目
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary, 
    num_topics=num_topics,
    alpha='auto',
    eta='auto',
    minimum_probability=0.001,
    passes=30,
    random_state=42)

# 输出每个文档的主题分布
topic_result = [a for a in lda_model[corpus_tfidf]]
print('每个文档的主题分布:')
pprint(topic_result)

# 输出每个主题的单词分布
print('每个主题的单词分布:')
for topic_id, topic in lda_model.print_topics(num_topics=num_topics,num_words=7):
    print(f"Topic {topic_id}: {topic}")

每个文档的主题分布:
[[(0, 0.012036153), (1, 0.023576388), (2, 0.9643874)],
 [(0, 0.012141197), (1, 0.9502056), (2, 0.03765317)],
 [(0, 0.01430587), (1, 0.94132924), (2, 0.04436493)],
 [(0, 0.014045403), (1, 0.94239825), (2, 0.043556374)],
 [(0, 0.9385156), (1, 0.023802863), (2, 0.037681542)],
 [(0, 0.014157631), (1, 0.027731901), (2, 0.9581105)],
 [(0, 0.016080055), (1, 0.031497534), (2, 0.95242244)],
 [(0, 0.011494063), (1, 0.022514513), (2, 0.96599144)],
 [(0, 0.017603736), (1, 0.034482174), (2, 0.94791406)]]
每个主题的单词分布:
Topic 0: 0.071*"user" + 0.071*"response" + 0.071*"relation" + 0.071*"measurement" + 0.071*"error" + 0.071*"time" + 0.071*"perceived"
Topic 1: 0.146*"system" + 0.079*"user" + 0.079*"eps" + 0.045*"response" + 0.045*"time" + 0.045*"opinion" + 0.045*"testing"
Topic 2: 0.086*"graph" + 0.086*"trees" + 0.060*"minors" + 0.034*"interface" + 0.034*"human" + 0.034*"survey" + 0.034*"computer"

这是确定主题数时LDA模型的构建方法，一般可以用指标来评估模型好坏，也可以用这些指标来确定最优主题数。一般用来评价LDA主题模型的指标有困惑度（perplexity）和主题一致性（coherence），困惑度越低或者一致性越高说明模型越好。一些研究表明perplexity并不是一个好的指标，但下面代码两种方法都用了。

#计算困惑度
def perplexity(num_topics):
    ldamodel = LdaModel(corpus,
                        num_topics=num_topics,
                        id2word=dictionary,
                        passes=30)
#     print(ldamodel.print_topics(num_topics=num_topics, num_words=7))
    print(ldamodel.log_perplexity(corpus))
    return ldamodel.log_perplexity(corpus)


#计算coherence
def coherence(num_topics):
    ldamodel = LdaModel(corpus,
                        num_topics=num_topics,
                        id2word=dictionary,
                        passes=30,
                        random_state=1)
#     print(ldamodel.print_topics(num_topics=num_topics, num_words=7))
    ldacm = CoherenceModel(model=ldamodel,
                           texts=text,
                           dictionary=dictionary,
                           coherence='c_v')
    print(ldacm.get_coherence())
    return ldacm.get_coherence()

x = range(3, 7)
# z = [perplexity(i) for i in x]  #如果想用困惑度就选这个
y = [coherence(i) for i in x]
plt.plot(x, y)
plt.xlabel('主题数目')
plt.ylabel('coherence大小')
plt.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['axes.unicode_minus'] = False
plt.title('主题-coherence变化情况')
plt.show() # 故还是确定三个主题

0.4917412262984759
0.4752187209409696
0.47555284837020473
0.4802759060122252

在这里插入图片描述

lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary, 
    num_topics=3,
    alpha='auto',
    eta='auto',
    minimum_probability=0.001,
    passes=30,
    random_state=42)
# 计算文档相似度
similarity = similarities.MatrixSimilarity(lda_model[corpus])
print('Similarity:')
pprint(list(similarity))

# 可视化主题模型
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
# 保存到 HTML 文件
pyLDAvis.save_html(lda_display, 'lda_vis_1.html')

Similarity:
[array([1.   , 0.024, 0.03 , 0.027, 0.02 , 1.   , 1.   , 1.   , 1.   ],
      dtype=float32),
 array([0.024, 1.   , 1.   , 1.   , 0.014, 0.028, 0.031, 0.023, 0.036],
      dtype=float32),
 array([0.03 , 1.   , 1.   , 1.   , 0.016, 0.034, 0.037, 0.029, 0.042],
      dtype=float32),
 array([0.027, 1.   , 1.   , 1.   , 0.015, 0.03 , 0.033, 0.025, 0.039],
      dtype=float32),
 array([0.02 , 0.014, 0.016, 0.015, 1.   , 0.022, 0.023, 0.019, 0.026],
      dtype=float32),
 array([1.   , 0.028, 0.034, 0.03 , 0.022, 1.   , 1.   , 1.   , 1.   ],
      dtype=float32),
 array([1.   , 0.031, 0.037, 0.033, 0.023, 1.   , 1.   , 1.   , 1.   ],
      dtype=float32),
 array([1.   , 0.023, 0.029, 0.025, 0.019, 1.   , 1.   , 1.   , 1.   ],
      dtype=float32),
 array([1.   , 0.036, 0.042, 0.039, 0.026, 1.   , 1.   , 1.   , 1.   ],
      dtype=float32)]

在这里插入图片描述

例2

使用 Python 自带的nltk 库中的 reuters 数据集进行一个主题分析案例。

import nltk
from nltk.corpus import reuters
from gensim import corpora, models
import pyLDAvis.gensim_models
import pyLDAvis

# 下载 nltk 数据集
nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')

# 停用词列表
stop_words = set(nltk.corpus.stopwords.words('english'))

# 获取文档 ID
documents = reuters.fileids()
print(f"Number of documents: {len(documents)}")

# 文本预处理函数
def preprocess(text):
    tokens = nltk.word_tokenize(text.lower())
    return [word for word in tokens if word.isalpha() and word not in stop_words]

# 分词和去除停用词
texts = [preprocess(reuters.raw(doc_id)) for doc_id in documents]

# 打印前两个预处理后的文档
print("Example of preprocessed texts:")
print(texts[:2])

# 创建词典
dictionary = corpora.Dictionary(texts)

# 创建文档-词频矩阵
corpus = [dictionary.doc2bow(text) for text in texts]

# 打印前两个文档的词袋表示
print("Example of bag-of-words representation:")
print(corpus[:2])

# 设定 LDA 模型的参数
num_topics = 10  # 指定要找出的主题数目

# 训练 LDA 模型
lda_model = models.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    update_every=1,
    passes=10,
    alpha='auto',
    per_word_topics=True
)

# 输出每个主题的单词分布
print("每个主题的单词分布:")
for idx, topic in lda_model.print_topics(num_topics=num_topics, num_words=5):
    print(f"Topic {idx}: {topic}")

# 可视化主题模型
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary, sort_topics=False)

# 显示在 Jupyter Notebook 中
pyLDAvis.display(lda_display)

# 保存到 HTML 文件
pyLDAvis.save_html(lda_display, 'lda_reuters.html')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of documents: 10788
Example of preprocessed texts:
[['asian', 'exporters', 'fear', 'damage', 'rift', 'mounting', 'trade', 'friction', 'japan', 'raised', 'fears', 'among', 'many', 'asia', 'exporting', 'nations', 'row', 'could', 'inflict', 'economic', 'damage', 'businessmen', 'officials', 'said', 'told', 'reuter', 'correspondents', 'asian', 'capitals', 'move', 'japan', 'might', 'boost', 'protectionist', 'sentiment', 'lead', 'curbs', 'american', 'imports', 'products', 'exporters', 'said', 'conflict', 'would', 'hurt', 'tokyo', 'loss', 'might', 'gain', 'said', 'impose', 'mln', 'dlrs', 'tariffs', 'imports', 'japanese', 'electronics', 'goods', 'april', 'retaliation', 'japan', 'alleged', 'failure', 'stick', 'pact', 'sell', 'semiconductors', 'world', 'markets', 'cost', 'unofficial', 'japanese', 'estimates', 'put', 'impact', 'tariffs', 'billion', 'dlrs', 'spokesmen', 'major', 'electronics', 'firms', 'said', 'would', 'virtually', 'halt', 'exports', 'products', 'hit', 'new', 'taxes', 'would', 'able', 'business', 'said', 'spokesman', 'leading', 'japanese', 'electronics', 'firm', 'matsushita', 'electric', 'industrial', 'co', 'ltd', 'lt', 'tariffs', 'remain', 'place', 'length', 'time', 'beyond', 'months', 'mean', 'complete', 'erosion', 'exports', 'goods', 'subject', 'tariffs', 'said', 'tom', 'murtha', 'stock', 'analyst', 'tokyo', 'office', 'broker', 'lt', 'james', 'capel', 'co', 'taiwan', 'businessmen', 'officials', 'also', 'worried', 'aware', 'seriousness', 'threat', 'japan', 'serves', 'warning', 'us', 'said', 'senior', 'taiwanese', 'trade', 'official', 'asked', 'named', 'taiwan', 'trade', 'trade', 'surplus', 'billion', 'dlrs', 'last', 'year', 'pct', 'surplus', 'helped', 'swell', 'taiwan', 'foreign', 'exchange', 'reserves', 'billion', 'dlrs', 'among', 'world', 'largest', 'must', 'quickly', 'open', 'markets', 'remove', 'trade', 'barriers', 'cut', 'import', 'tariffs', 'allow', 'imports', 'products', 'want', 'defuse', 'problems', 'possible', 'retaliation', 'said', 'paul', 'sheen', 'chairman', 'textile', 'exporters', 'lt', 'taiwan', 'safe', 'group', 'senior', 'official', 'south', 'korea', 'trade', 'promotion', 'association', 'said', 'trade', 'dispute', 'japan', 'might', 'also', 'lead', 'pressure', 'south', 'korea', 'whose', 'chief', 'exports', 'similar', 'japan', 'last', 'year', 'south', 'korea', 'trade', 'surplus', 'billion', 'dlrs', 'billion', 'dlrs', 'malaysia', 'trade', 'officers', 'businessmen', 'said', 'tough', 'curbs', 'japan', 'might', 'allow', 'producers', 'semiconductors', 'third', 'countries', 'expand', 'sales', 'hong', 'kong', 'newspapers', 'alleged', 'japan', 'selling', 'semiconductors', 'electronics', 'manufacturers', 'share', 'view', 'businessmen', 'said', 'commercial', 'advantage', 'would', 'outweighed', 'pressure', 'block', 'imports', 'view', 'said', 'lawrence', 'mills', 'federation', 'hong', 'kong', 'industry', 'whole', 'purpose', 'prevent', 'imports', 'one', 'day', 'extended', 'sources', 'much', 'serious', 'hong', 'kong', 'disadvantage', 'action', 'restraining', 'trade', 'said', 'last', 'year', 'hong', 'kong', 'biggest', 'export', 'market', 'accounting', 'pct', 'domestically', 'produced', 'exports', 'australian', 'government', 'awaiting', 'outcome', 'trade', 'talks', 'japan', 'interest', 'concern', 'industry', 'minister', 'john', 'button', 'said', 'canberra', 'last', 'friday', 'kind', 'deterioration', 'trade', 'relations', 'two', 'countries', 'major', 'trading', 'partners', 'serious', 'matter', 'button', 'said', 'said', 'australia', 'concerns', 'centred', 'coal', 'beef', 'australia', 'two', 'largest', 'exports', 'japan', 'also', 'significant', 'exports', 'country', 'meanwhile', 'diplomatic', 'manoeuvres', 'solve', 'trade', 'continue', 'japan', 'ruling', 'liberal', 'democratic', 'party', 'yesterday', 'outlined', 'package', 'economic', 'measures', 'boost', 'japanese', 'economy', 'measures', 'proposed', 'include', 'large', 'supplementary', 'budget', 'record', 'public', 'works', 'spending', 'first', 'half', 'financial', 'year', 'also', 'call', 'spending', 'emergency', 'measure', 'stimulate', 'economy', 'despite', 'prime', 'minister', 'yasuhiro', 'nakasone', 'avowed', 'fiscal', 'reform', 'program', 'deputy', 'trade', 'representative', 'michael', 'smith', 'makoto', 'kuroda', 'japan', 'deputy', 'minister', 'international', 'trade', 'industry', 'miti', 'due', 'meet', 'washington', 'week', 'effort', 'end', 'dispute'], ['china', 'daily', 'says', 'vermin', 'eat', 'pct', 'grain', 'stocks', 'survey', 'provinces', 'seven', 'cities', 'showed', 'vermin', 'consume', 'seven', 'pct', 'china', 'grain', 'stocks', 'china', 'daily', 'said', 'also', 'said', 'year', 'mln', 'tonnes', 'pct', 'china', 'fruit', 'output', 'left', 'rot', 'mln', 'tonnes', 'pct', 'vegetables', 'paper', 'blamed', 'waste', 'inadequate', 'storage', 'bad', 'preservation', 'methods', 'said', 'government', 'launched', 'national', 'programme', 'reduce', 'waste', 'calling', 'improved', 'technology', 'storage', 'preservation', 'greater', 'production', 'additives', 'paper', 'gave', 'details']]
Example of bag-of-words representation:
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 2), (6, 4), (7, 1), (8, 2), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 2), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 5), (25, 1), (26, 2), (27, 1), (28, 1), (29, 1), (30, 4), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 2), (53, 1), (54, 2), (55, 1), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 1), (62, 1), (63, 2), (64, 6), (65, 1), (66, 1), (67, 2), (68, 2), (69, 1), (70, 1), (71, 4), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 3), (80, 1), (81, 6), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 2), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 4), (104, 1), (105, 1), (106, 1), (107, 5), (108, 1), (109, 1), (110, 1), (111, 3), (112, 1), (113, 1), (114, 1), (115, 1), (116, 12), (117, 4), (118, 1), (119, 1), (120, 4), (121, 3), (122, 1), (123, 1), (124, 2), (125, 4), (126, 1), (127, 2), (128, 1), (129, 1), (130, 1), (131, 1), (132, 3), (133, 1), (134, 2), (135, 1), (136, 1), (137, 1), (138, 1), (139, 1), (140, 1), (141, 2), (142, 1), (143, 1), (144, 1), (145, 1), (146, 1), (147, 2), (148, 1), (149, 1), (150, 4), (151, 1), (152, 3), (153, 1), (154, 1), (155, 1), (156, 1), (157, 1), (158, 1), (159, 1), (160, 1), (161, 1), (162, 1), (163, 1), (164, 1), (165, 1), (166, 1), (167, 1), (168, 2), (169, 2), (170, 1), (171, 1), (172, 1), (173, 1), (174, 1), (175, 1), (176, 1), (177, 1), (178, 1), (179, 1), (180, 2), (181, 1), (182, 1), (183, 2), (184, 1), (185, 1), (186, 1), (187, 1), (188, 1), (189, 3), (190, 1), (191, 1), (192, 1), (193, 1), (194, 1), (195, 1), (196, 1), (197, 1), (198, 1), (199, 1), (200, 1), (201, 1), (202, 1), (203, 1), (204, 1), (205, 1), (206, 1), (207, 2), (208, 1), (209, 1), (210, 1), (211, 1), (212, 1), (213, 16), (214, 1), (215, 1), (216, 1), (217, 3), (218, 2), (219, 1), (220, 2), (221, 1), (222, 1), (223, 1), (224, 1), (225, 1), (226, 1), (227, 1), (228, 1), (229, 1), (230, 3), (231, 2), (232, 1), (233, 1), (234, 1), (235, 1), (236, 1), (237, 1), (238, 1), (239, 3), (240, 1), (241, 4), (242, 1), (243, 1), (244, 5), (245, 1), (246, 1), (247, 1), (248, 1), (249, 1), (250, 2), (251, 1), (252, 1), (253, 1), (254, 15), (255, 1), (256, 2), (257, 1), (258, 1), (259, 2), (260, 1), (261, 1), (262, 1), (263, 1), (264, 1), (265, 1), (266, 1), (267, 1), (268, 2), (269, 1), (270, 4), (271, 1), (272, 4), (273, 1)], [(6, 1), (97, 1), (154, 2), (180, 4), (213, 3), (272, 1), (274, 1), (275, 1), (276, 1), (277, 1), (278, 4), (279, 1), (280, 1), (281, 2), (282, 1), (283, 1), (284, 1), (285, 1), (286, 2), (287, 1), (288, 1), (289, 1), (290, 1), (291, 1), (292, 1), (293, 1), (294, 1), (295, 2), (296, 2), (297, 1), (298, 1), (299, 1), (300, 1), (301, 1), (302, 1), (303, 2), (304, 1), (305, 2), (306, 2), (307, 1), (308, 1), (309, 2), (310, 1), (311, 2), (312, 2)]]
每个主题的单词分布:
Topic 0: 0.037*"said" + 0.033*"oil" + 0.015*"lt" + 0.010*"gas" + 0.010*"company"
Topic 1: 0.035*"said" + 0.016*"would" + 0.016*"trade" + 0.007*"japan" + 0.006*"government"
Topic 2: 0.035*"said" + 0.014*"wheat" + 0.010*"grain" + 0.009*"gulf" + 0.009*"corn"
Topic 3: 0.053*"billion" + 0.036*"mln" + 0.034*"dlrs" + 0.033*"pct" + 0.031*"year"
Topic 4: 0.143*"vs" + 0.109*"mln" + 0.067*"cts" + 0.066*"net" + 0.054*"loss"
Topic 5: 0.050*"said" + 0.034*"dlrs" + 0.030*"lt" + 0.026*"mln" + 0.024*"company"
Topic 6: 0.048*"pct" + 0.035*"tonnes" + 0.032*"mln" + 0.024*"january" + 0.023*"february"
Topic 7: 0.073*"cts" + 0.046*"april" + 0.042*"record" + 0.039*"dividend" + 0.036*"lt"
Topic 8: 0.043*"said" + 0.030*"shares" + 0.023*"lt" + 0.023*"stock" + 0.021*"pct"
Topic 9: 0.033*"said" + 0.028*"bank" + 0.016*"market" + 0.015*"dollar" + 0.014*"pct"

在这里插入图片描述

二、PageRank算法

PageRank是互联网网页重要度的计算方法，可以定义推广到任意有向图结点的重要度计算上。其基本思想是在有向图上定义随机游走模型，即一阶马尔可夫链，描述游走者沿着有向图随机访问各个结点的行为，在一定条件下，极限情况访问每个结点的概率收敛到平稳分布，这时各个结点的概率值就是其 PageRank值，表示结点相对重要度。
有向图上可以定义随机游走模型，即一阶马尔可夫链，其中结点表示状态，有向边表示状态之间的转移，假设一个结点到连接出的所有结点的转移概率相等。转移概率由转移矩阵 $M$ 表示
$\times n }$
第 $i$ 行第 $j$ 列的元素 $m _ { i j }$ 表示从结点 $j$ 跳转到结点 $i$ 的概率。
当含有 $n$ 个结点的有向图是强连通且非周期性的有向图时，在其基础上定义的随机游走模型，即一阶马尔可夫链具有平稳分布，平稳分布向量 $R$ 称为这个有向图的 PageRank。若矩阵 $M$ 是马尔可夫链的转移矩阵，则向量R满足 $MR = R$ 向量 $R$ 的各个分量称 PageRank为各个结点的值。
$\left[ \begin{array} { c } { P R ( v _ { 1 } ) } \\ { P R ( v _ { 2 } ) } \\ { \vdots } \\ { P R ( v _ { n } ) } \end{array} \right]$
其中 $\cdots , n$ ，表示结点 $v_i$ 的 PageRank值。这是 PageRank的基本定义。
PageRank基本定义的条件现实中往往不能满足，对其进行扩展得到 PageRank的一般定义。任意含有 $n$ 个结点的有向图上，可以定义一个随机游走模型，即一阶马尔可夫链，转移矩阵由两部分的线性组合组成，其中一部分按照转移矩阵 $M$ ，从一个结点到连接出的所有结点的转移概率相等，另一部分按照完全随机转移矩阵，从任一结点到任一结点的转移概率都是 $1/ n$ 。这个马尔可夫链存在平稳分布，平稳分布向量R称为这个有 PageRank向图的一般，满足
$\frac { 1 - d } { n } 1$