机器学习基础算法34-主题模型与实践

最新推荐文章于 2024-06-25 13:31:56 发布

qq_42749341

最新推荐文章于 2024-06-25 13:31:56 发布

阅读量459

点赞数

分类专栏：机器学习-基础知识

本文链接：https://blog.csdn.net/qq_42749341/article/details/108440587

版权

机器学习-基础知识专栏收录该内容

36 篇文章 1 订阅

订阅专栏

主题模型

定义

主题模型（topic model）是以非监督学习的方式对文集的隐含语义结构（latent semantic structure）进行聚类（clustering）的统计模型。

主题模型主要被用于自然语言处理（Natural language processing）中的语义分析（semantic analysis）和文本挖掘（text mining）问题，例如按主题对文本进行收集、分类和降维；也被用于生物信息学（bioinfomatics）研究 [2] 。隐含狄利克雷分布Latent Dirichlet Allocation, LDA）是常见的主题模型。

主题模型历史

在这里插入图片描述

简单案例引入

通过建立判断是否为算法大牛：
特征：

穿条纹衬衫
曾在BAT就职
做过大型项目
于主题模型对比：
在这里插入图片描述

知识储备：SVD——奇异值分解

1、特征值

在这里插入图片描述

2、SVD分解

在这里插入图片描述
通过案例手推SVD

步骤二：求解特征值

步骤三：求解特征向量

python实现

import numpy as np 
A = np.array([[2, 4], [1, 3], [0, 0], [0, 0]])
print(np.linalg.svd(A))

3、SVD与PCA

PCA理解和含义

PCA的问题其实是一个基的变换，使得变换后的数据有着最大的方差。
我们用于机器学习的数据（主要是训练数据），方差大才有意义，不然输入的数据都是同一个点，那方差就为0了，这样输入的多个数据就等同于一个数据了。
对原始的空间中顺序地找一组相互正交的坐标轴，第一个轴是使得方差最大的，第二个轴是在与第一个轴正交的平面中使得方差最大的，第三个轴是在与第1、2个轴正交的平面中方差最大的，这样假设在N维空间中，我们可以找到N个这样的坐标轴，我们取前r个去近似这个空间，这样就从一个N维的空间压缩到r维的空间了，但是我们选择的r个坐标轴能够使得空间的压缩使得数据的损失最小。

SVD推导PCA
SVD其实是两个方向上的PCA
PCA是对AA ^T的分解
而SVD是：U的列是AAT的特征向量，V的列是AT*A的特征向量
也可以通过SVD分解公式，利用求得得V求解 U（如上面手推求解答的案例）
在这里插入图片描述

PLSA——概率隐性语义分析

Probabilistic Latent Semantic Analysis

1、SVD

参考博客：https://blog.csdn.net/chlele0105/article/details/12983833

大矩阵A来描述这一百万篇文章和五十万词的关联性。这个矩阵中，每一行对应一篇文章，每一列对应一个词。
在这里插入图片描述

三个矩阵有非常清楚的物理含义。第一个矩阵X是对词进行分类的结果，每一列表示一类主题，其中的每个非零元素表示一个主题与一篇文章的相关性，数值越大越相关。最后一个矩阵Y中的每一列表示100个语义类/词类，每个语义类/词类与500，000个词的相关性。中间的矩阵则表示文章主题和语义类/词类之间的相关性。因此，我们只要对关联矩阵A进行一次奇异值分解，就可以同时完成了近义词分类和文章的分类。（同时得到每类文章和每类词的相关性）。

2、LSA

与上面1不同的是行列转置：行代表单词在每篇文档中出现的次数，列代表一篇文档中出现词语的分布。
在这里插入图片描述
依据奇异值分解的性质1，矩阵A可以分解出n个特征值，然后依据性质2，我们选取其中较大的r个并排序，这样USVT就可以近似表示矩阵AA。对于矩阵U，每一列代表一个潜语义，这个潜语义的意义由m个单词按不同权重组合而成。因为U中每一列相互独立，所以r个潜语义构成了一个语义空间。S中每一个奇异值指示了该潜语义的重要度。VT中每一列仍然是一篇文档，但此时文档被映射了语义空间。V^T的大小远小于A。
有了VT，我们就相当于有了矩阵AA的另外一种表示，之后我们就可以使用VT代替A进行之后的工作。

流程
（1）分析文档集合，建立词汇-文本矩阵A。
（2）对词汇-文本矩阵进行奇异值分解。
（3）对SVD分解后的矩阵进行降维
（4）使用降维后的矩阵构建潜在语义空间

LSA案例:
在这里插入图片描述
奇异值分解：

表示我们将文档映射到了一个3维语义空间中，其中第一维潜语义可以表示为

在图上，每一个红色的点，都表示一个词，每一个蓝色的点，都表示一个title，这样我们可以对这些词和title进行聚类，比如stock和market可以放在一类，这也符合他们经常出现在一起的直觉，real和estate可以放在一类，dads，guide这种词就看起来有点孤立了，我们就不对他们进行合并了。对于title，T1和T3可以聚成一类，T2、T4、T5和T8可以聚成一类，所以T1和T3比较相似，T2、T4、T5和T8比较相似。按这样聚类出现的效果，可以提取文档集合中的近义词，这样当用户检索文档的时候，是用语义级别（近义词集合）去检索了，而不是之前的词的级别。这样一减少我们的检索、存储量，因为这样压缩的文档集合和PCA是异曲同工的，二可以提高我们的用户体验，用户输入一个词，我们可以在这个词的近义词的集合中去找，这是传统的索引无法做到的。

LSA的优缺点
优点

1）低维空间表示可以刻画同义词，同义词会对应着相同或相似的主题。
2）降维可去除部分噪声，是特征更鲁棒。
3）充分利用冗余数据。
4）无监督/完全自动化。
5）与语言无关。

缺点

1）LSA可以处理向量空间模型无法解决的一义多词(synonymy)问题，但不能解决一词多义(polysemy)问题。因为LSA将每一个词映射为潜在语义空间中的一个点，也就是说一个词的多个意思在空间中对于的是同一个点，并没有被区分。
2）SVD的优化目标基于L-2 norm 或者 Frobenius Norm 的，这相当于隐含了对数据的高斯分布假设。而 term 出现的次数是非负的，这明显不符合 Gaussian 假设，而更接近 Multi-nomial分布。(需要进一步研究为什么)
3）特征向量的方向没有对应的物理解释。
4）SVD的计算复杂度很高，而且当有新的文档来到时，若要更新模型需重新训练。
5）没有刻画term出现次数的概率模型。
6）对于count vectors 而言，欧式距离表达是不合适的（重建时会产生负数）。
7）维数的选择是ad-hoc的。
8）LSA具有词袋模型的缺点，即在一篇文章，或者一个句子中忽略词语的先后顺序。
9）LSA的概率模型假设文档和词的分布是服从联合正态分布的，但从观测数据来看是服从泊松分布的。因此LSA算法的一个改进PLSA使用了多项分布，其效果要好于LSA。

LSA——python实现
gensim

class gensim.models.lsimodel.LsiModel(corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<type ‘numpy.float64’>)

关键参数：

corpus：文本语料
num_topic:保留的语义维数
id2word:ID到单词映射

该对象包括如下方法：

LsiModel.projection.u，获得左奇异向量；
LsiModel.projection.s，获得奇异值；
add_documents()，用新的语料更新模型；
get_topics()，获取所有潜语义的向量表示；
save()，保存模型到本地；
load()，从本地加载模型；
print_topic(topicno, topn=10)，以string的形式输出第topicno个潜在语义的前topn个单词表示；
print_topics(num_topics=20, num_words=10)，以string形式输出前num_topics个潜在语义，每个语义用num_words个单词表示；
show_topic(topicno, topn=10)，获取定义第topicno个潜在语义的单词及其贡献；

from gensim.test.utils import common_corpus,common_dictionary,get_tmpfile
from gensim.models import LsiModel
#构建模型
print(common_corpus)
'''
#[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1),......
输出的为bag-of-word后的元组，（id，count）
#语料为：common_texts = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]
'''

model = LsiModel(common_corpus[:3],id2word=common_dictionary,num_topics=3)
#将文档映射到语义空间
vector = model[common_corpus[4]]
#更新模型
model.add_documents(common_corpus[4:])

tmp_fname = get_tmpfile("lsi.model")
model.save(tmp_fname)  # save model
loaded_model = LsiModel.load(tmp_fname)  # load model

umatri = loaded_model.projection.u
print(umatri)
print(umatri.shape)
ss = umatri = loaded_model.projection.s
print(ss)
allt = loaded_model.get_topics()
print(allt)
t1 = loaded_model.print_topic(1,topn=12)
print(t1)
s1 = loaded_model.show_topic(1,topn=12)
print(s1)

[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]
[[ 0.31308271 0.04475507 0.19130952]
[ 0.06218628 0.02061632 0.33478657]
[ 0.19775074 0.06519182 0.65889099]
[ 0.39382732 0.05923975 -0.3405499 ]
[ 0.29531217 -0.18289329 -0.18223293]
[ 0.38646088 0.06871426 0.18062738]
[ 0.39382732 0.05923975 -0.3405499 ]
[ 0.52939178 0.10381526 -0.01644547]
[ 0.13556445 0.04457551 0.32410443]
[ 0.0363079 -0.53458542 0.11367757]
[ 0.08395235 -0.65620128 0.0321324 ]
[ 0.07242846 -0.46795875 -0.01133593]]

(12, 3)
[3.03555067 2.51654597 1.88136386]
[[ 0.31308271 0.06218628 0.19775074 0.39382732 0.29531217 0.38646088
0.39382732 0.52939178 0.13556445 0.0363079 0.08395235 0.07242846]
[ 0.04475507 0.02061632 0.06519182 0.05923975 -0.18289329 0.06871426
0.05923975 0.10381526 0.04457551 -0.53458542 -0.65620128 -0.46795875]
[ 0.19130952 0.33478657 0.65889099 -0.3405499 -0.18223293 0.18062738
-0.3405499 -0.01644547 0.32410443 0.11367757 0.0321324 -0.01133593]]

-0.656*“graph” + -0.535*“trees” + -0.468*“minors” + -0.183*“survey” + 0.104*“user” + 0.069*“system” + 0.065*“interface” + 0.059*“time” + 0.059*“response” + 0.045*“computer” + 0.045*“eps” + 0.021*“human”

[(‘graph’, -0.6562012837041503), (‘trees’, -0.5345854212634996), (‘minors’, -0.46795875376068946), (‘survey’, -0.18289328901588417), (‘user’, 0.10381525918751365), (‘system’, 0.06871425877803433), (‘interface’, 0.06519182403581872), (‘time’, 0.05923975195143014), (‘response’, 0.059239751951430136), (‘computer’, 0.044755068341685994), (‘eps’, 0.044575507236083514), (‘human’, 0.02061631679973524)]

3、PLSA

参考链接：https://www.cnblogs.com/Determined22/p/7237111.html

https://blog.csdn.net/pipisorry/article/details/42560693

pLSA 模型是有向图模型，将主题作为隐变量，构建了一个简单的贝叶斯网，采用EM算法估计模型参数。
由于PLSA属于LSA到LDA的过滤，很少被使用~~ 可以减少研究！
[外链图片转存失败(img-uA7SMdbB-1567562386972)(en-resource://database/1825:1)]

PlSA原理

应用

1、 PLSA：文档生成模型

类似于掷色子的游戏，假定好主题个数后，建立“文档-主题”筛子，这个就是主题分布；之后建立“主题-词项”筛子，这个是词分布，两个分布都是多项式分布。

选主题和选词都是两个随机的过程，先从主题分布{教育：0.5，经济：0.3，交通：0.2}中抽取出主题：教育，然后从该主题对应的词分布{大学：0.5，老师：0.3，课程：0.2}中抽取出词：大学。
在这里插入图片描述

2、利用文档推断主题分布

主题建模的目的：自动地发现文档集中的主题（分布）。

文档d和单词w自然是可被观察到的，但主题z却是隐藏的。
在这里插入图片描述
这个图的意思是，文档中的每一个词都是先选定一个主题，再从中选择词得到；文档中的每个词并不一定对应同一个主题z（z放在了小方框的里面了）。
d和w是可以通过样本得到的，所以对于任意一篇文档，其p(w|d)是已知的。
在这里插入图片描述

3、PLSA算法的EM推导

期望最大化
迭代直到收敛
在这里插入图片描述
EM算法的一般步骤：

LDA

Latent Dirichlet allocation——隐含狄利克雷分布
主要应用领域：文本主题识别、文本分类以及文本相似度计算。
无监督学习，需要文档集和主题数
可以用一些词语来表示主题

模型示意图：

在这里插入图片描述

与PLSA相比较，多了两个狄利克雷分布的先验知识。

案例：主题预测——基于gensim

1、步骤：

1、查看数据

2、分词 :jieba 、df.apply(fun)

3、预处理：去除标点、英文、停用词（需要停用词表）

4、生成词典：‘

  from gensim import corpora        
  dictionary = corpora.Dictionary(tokenized_corpus)

#Dictionary(615 unique tokens: [‘高血压’, ‘减轻’, ‘周身’, ‘咳嗽’, ‘流涕’]…)
其中：tokenized_corpus为分词预处理后的语料列表
5、生成稀疏矩阵

  doc_term_matrix = [dictionnary.doc2bow(doc) for doc in tolenised_corpus]

#[[(0, 1)], [(1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 2), (14, 2), (15, 1), (16, 1)],…

6、训练模型

   lda  = LdaModel(doc_term_matrix,num_topics=20,id2word=dictionary,)

7、保存模型

    lda.save(fname= "yiyaoLdaModel")

8、预测给定文档的主题分布

文本预处理，转化为“词+词频”的词矩阵
分词
去除停用词
dictionary.doc2bow(list(text_list))#转化为词矩阵
lda模型预测：lda[dictionary.doc2bow(list(text_list))]#主题+权重
print(max(lda_vector, key=lambda item: item[1]))##权重最大的主题标号+权重

2、代码

"""
@version: 3.7
@author: jiayalu
@file: ldaModel.py
@time: 09/08/2019 09:36
@description: 利用LDA模型获取主题，利用gensim训练
步骤：
    1、查看数据
    2、分词 :jieba 、df.apply(fun)
    3、预处理：去除标点、英文、停用词、
    4、生成词典：from gensim import corpora        dictionary = corpora.Dictionary(tokenized_corpus)
    5、生成稀疏矩阵
    6、训练模型
    7、预测给定文档的主题分布


"""
#查看数据，并选取有用数据、修改表头
import pandas as pd
df = pd.read_csv("E:\pythoncode\drugdescrib1.csv",encoding="utf-8")

print(df.head())

df.rename(columns = {"【药品名称】":"name","【适应症】":"illness"},inplace=True)
df1 = df[["name","illness"]]
print(df1.head())
#分词
import jieba
def chinese_word_cut(mytext):
    return " ".join(jieba.cut(str(mytext)))
print(df1.illness.head())
df["content_citted"] = df1.illness.apply(chinese_word_cut)
print(df.content_citted.head())

#预处理：去除停用词、标点
import re
from nltk.corpus import stopwords

def load_punctuations():
    pun_list = []
    with open("./pun_list.txt",encoding="utf-8")as fr:#表达符号数据
        for line in fr:
            line = line.strip()
            pun_list.append(line)
    return pun_list

#英文停用词
# english_stopwords = stopwords.words("english")
chinese_stopwords = {}.fromkeys([line.strip() for line in open("chinese_stopwords.txt",encoding="utf-8")])#中文停用词文件
pun_list = load_punctuations()

def clean_text(text):
    text = text.strip()
    for pun in pun_list:
        text = text.replace(pun , " ")
    # new_text = ' '.join([w for w in text.split() if w not in english_stopwords and w not in chinese_stopwords and len(w)>1])
    new_text = ' '.join([w for w in text.split() if  w not in chinese_stopwords and len(w) > 1])
    return new_text


import gensim
from gensim.models.ldamodel import LdaModel
from gensim import corpora
from nltk import wordpunct_tokenize


class Token_Corpus(object):
    def __init__(self, corpus):
        self.corpus = corpus

    def __iter__(self):
        for text in self.corpus:
            text = text.strip()
            text = clean_text(text)
            yield self.tokenize(text)

    def tokenize(self, text):
        token = wordpunct_tokenize(text)
        return token

document = list(df["content_citted"])
print(document)
tokenized_corpus = Token_Corpus(document)
print(tokenized_corpus)
# for i in tokenized_corpus:
#     print(i)

#生成字典
dictionary = corpora.Dictionary(tokenized_corpus)
dictionary.filter_extremes(no_below=10,no_above=0.10)
print(dictionary)#Dictionary(615 unique tokens: ['高血压', '减轻', '周身', '咳嗽', '流涕']...)
# for i in dictionary:
#     print(i)#数字0，1，2，3.....


#生成稀疏矩阵
class MyCorpus(object):
    def __init__(self, token_list, dictionary):
        self.token_list = token_list
        self.dictionary = dictionary

    def __iter__(self):
        for tokens in self.token_list:
            yield self.dictionary.doc2bow(tokens)

doc_term_matrix = [dictionary.doc2bow(doc) for doc in tokenized_corpus]
print(doc_term_matrix)
# corpus = MyCorpus(tokenized_corpus,dictionary)
# mm_corpus = gensim.corpora.MmCorpus('data_science.mm')
# print(mm_corpus)

#训练模型

# lda  = LdaModel(doc_term_matrix,num_topics=20,id2word=dictionary,)
# print(lda.print_topics(num_topics=5,num_words=5))
# lda.save(fname= "yiyaoLdaModel")
text = """
1  肾动脉狭窄有报道称：ACE抑制剂可能使单侧或者双侧肾动脉狭窄患者的血肌酐或者血尿素氮（BUN）升高，但还没有在此类患者中长期使用本品的经验，但是可能会出现类似的结果。
 2  肾功能损害在那些肾功能依赖于肾素-血管紧张素-醛固酮系统活性的患者中（如严重的充分性心力衰竭患者）使用ACE抑制剂和AT1 受体拮抗剂,可能出现少尿和/或进行性氮质免疫,
 性肾功能衰竭和/或死亡(罕见)，在此类患者中使用奥美沙坦酯治疗。3  胎儿/新生儿发病和死亡3.胎儿/新生儿发病和死亡对D类妊娠（第Ⅱ期和第Ⅲ期），直接作用于RAS的药物与胎儿
 和新生儿的损伤有关。一旦发现妊娠，应当尽快停止使用本品。如果必须用药，应当告知这些孕妇关于药物对他们胎儿的潜在危害，并进行系列超声波检查来评估羊膜内的情况。曾经在
 子宫内与血管紧张素Ⅱ受体拮抗剂接触过的婴儿应密切监测其血压过低，少尿和高血钾的情况，必要时做适当的治疗。 4  血容量不足或者低钠患者的低血压血容量不足或者低钠患者
 （例如那些使用大剂量利尿剂治疗的患者），在首次服用本品后可能会发生症状性低血压，必须在周密的医疗监护下使用该药治疗。如果发生低血压，患者应仰卧，必要时静脉注射生理盐水。
 一旦血压稳定，可继续用本品治疗。"""
text1 ="""
适用于敏感菌（不产β内酰胺酶菌株）所致的下列感染：1溶血链球菌、肺炎链球菌、葡萄球菌或流感嗜血杆菌所致中耳炎、鼻窦炎、咽炎、扁桃体炎等上呼吸道感染。
2大肠埃希菌、奇异变形杆菌或粪肠球菌所致的泌尿生殖道感染。
3溶血链球菌、葡萄球菌或大肠埃希菌所致的皮肤软组织感染。
4溶血链球菌、肺炎链球菌、葡萄球菌或流感嗜血杆菌所致急性支气管炎、肺炎等下呼吸道感染。5急性单纯性淋病。7本品尚可用于治疗伤寒"""
lda = LdaModel.load("yiyaoLdaModel")
text_list = clean_text(chinese_word_cut(text)).split(" ")#分词后的列表，['适用', '敏感', '不产', '内酰胺酶' ..
print(text_list)
print("bow___________________________")
print(dictionary.doc2bow(list(text_list)))#词矩阵：词+词频#[(29, 1), (30, 1), (31, 1), (32, 2), (33, 2), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 1), (40, 1), (41, 3), (114, 1), (137, 1), (145, 1)]
# lda_vector = lda[dictionary.doc2bow(list(text_list))]
lda_vector = lda[dictionary.doc2bow(list(text_list))]
print(lda_vector)#主题标号+权重
# Prints the most likely Topic. Performs Max based on the 2nd element in the tuple
print(max(lda_vector, key=lambda item: item[1]))##权重最大的主题标号+权重
print(lda.print_topic(max(lda_vector, key=lambda item: item[1])[0]))
# print()

3、部分结果

在这里插入图片描述

案例：主题预测——基于sklearn

1、步骤

1、加载数据
读取——分词——写入
数据格式为 “沙瑞金向毛娅打听他们家在京州的别墅，毛娅笑着说，王大路事业有成之后，要给欧阳菁和她公司的股权，她们没有要，王大路就在京州帝豪园买了三套别墅，可是李达康和易学习都不要，这些房子都在王大路的名下，欧阳菁好像去住过，毛娅不想去，她觉得房子太大很浪费，自己家住得就很踏实。”

2、加载停用词
读取——写入list

3、统计训练语料的词频

from sklearn.fearure_extraction.text import CountVectorizer
from sklearn.externals import joblib
#训练语料,res1,res2,res3,...为读取的文本
corpus = [res1,res2,res3...]
#加载模型
cntVector = CountVectorizer(stop_words=stopwrd_list)
cntTf = cntVector.fit_transform(corpus)
#保存便于测试时加载使用
joblib.dump(cntVector, "cntVector")

cntTf:(文档索引，词索引) 词频

4、训练模型

lda = LatentDirichletAllocation(n_components=2, max_iter=5,learning_method='online',learning_offset=50.,random_state=0)
docres = lda.fit(cntTf)
with open("lda.pickle", "wb") as fw:
    pickle.dump(docres,fw)print(docres)

5、查看主题词

def print_top_words(lda, features_names, n_top_words):
    for topic_idx, topic in enumerate(lda.components_):
        print("Topic#%d"%topic_idx)
        print(" ".join([features_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]))
n_top_words = 50
tf_fearures_names = cntVector.get_feature_names()
print_top_words(lda,tf_fearures_names,n_top_words)

6、预测
加载数据——加载词频统计——格式转换———加载模型——预测

cut_content("nlp_test2.txt")
res2 = get_corpus("cut_nlp_test2.txt")
corpus2 = [res2]
cntVectorload = joblib.load("cntVector")
print(cntVectorload)
cntTf2 = cntVectorload.transform(corpus2)
print(cntTf2)
cntTf2.toarray()
with open("lda.pickle", "rb") as fr:
    lda = pickle.load(fr)
topic_dist = lda.transform(cntTf2)
print(topic_dist)

在这里插入图片描述
注意训练模型用fit，预测模型用transform

2、代码

"""
@version: 3.7
@author: jiayalu
@file: ldaOfsklearn.py
@time: 13/08/2019 09:45
@description: 利用sklearn 库中的lda训练模型
步骤：
1、加载数据，加载停用词典，都转化为列表形式
2、利用CountVectorizer统计词频并保存，便于对新的预测文本进行转化
3、lda模型训练
4、预测
"""

import jieba
#加载数据，分词
def cut_content(filename):
    with open(filename, "r", encoding="utf-8") as f:
        document = f.read()
        document_cut = jieba.cut(document)
        result = " ".join(document_cut)
        with open("cut_" + filename, "w", encoding="utf-8")as f2:
            f2.writelines(result)
    f.close()
    f2.close()
    # print(n)
#加载停用词典
stopwrd_dic = open("stopword.txt", "r", encoding="utf-8")
stopwrd_content = stopwrd_dic.read()
stopwrd_list = stopwrd_content.splitlines()
stopwrd_dic.close()

#读取预处理后的数据
def get_corpus(filename):
    with open(filename, "r", encoding="utf-8")as f:
        resf = f.read()
    return resf

cut_content("nlp_test1.txt")
res1 = get_corpus("cut_nlp_test1.txt")

#一般是基于词频的统计，很少用tfidf
# from sklearn.feature_extraction.text import TfidfVectorizer
# corpus = [res1]
# vector = TfidfVectorizer(stop_words=stopwrd_list)
# tfidf = vector.fit_transform(corpus)
# print(vector.get_feature_names())
# print(tfidf)
# print(tfidf.shape)
#
# wordlist = vector.get_feature_names()#获取词袋模型中的所有词
# #tf-idf矩阵 元素a[i][j]表示j词在i类文本中的tf-idf权重
# weightlist = tfidf.toarray()
# print(weightlist)
# for i in range(len(weightlist)):
#     print("-------第",i,"段文本的词语tf-idf权重------"  )
#     for j in range(len(wordlist)):
#         print(wordlist[j],weightlist[i][j])

#统计词频
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.externals import joblib
import pickle
corpus1 = [res1]
cntVector = CountVectorizer(stop_words=stopwrd_list)
cntTf = cntVector.fit_transform(corpus1)
joblib.dump(cntVector, "cntVector")#保存cntVector,避免二次运算
print(cntTf)
print(cntTf.toarray())#矩阵，文档-词频
#模型训练
lda = LatentDirichletAllocation(n_components=2, max_iter=5,learning_method='online',learning_offset=50.,random_state=0)
docres = lda.fit(cntTf)
with open("lda.pickle", "wb") as fw:
    pickle.dump(docres,fw)
print(docres)

#topic top words
def print_top_words(lda, features_names, n_top_words):
    for topic_idx, topic in enumerate(lda.components_):
        print("Topic#%d"%topic_idx)
        print(" ".join([features_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]))
n_top_words = 50
tf_fearures_names = cntVector.get_feature_names()
print_top_words(lda,tf_fearures_names,n_top_words)


#预测
cut_content("nlp_test2.txt")
res2 = get_corpus("cut_nlp_test2.txt")
corpus2 = [res2]
cntVectorload = joblib.load("cntVector")
print(cntVectorload)
cntTf2 = cntVectorload.transform(corpus2)
print(cntTf2)
cntTf2.toarray()
with open("lda.pickle", "rb") as fr:
    lda = pickle.load(fr)
topic_dist = lda.transform(cntTf2)
print(topic_dist)

3、结果分析

返回的是主题矩阵，预测文本属于那个主题的概率。
可以通过查看主题词分布，获取预测文本的主题词分布
在这里插入图片描述

LDA 原理

1、狄利克雷函数

Dirichlet function
定义：定义在实数范围内、值域不连续的函数。以Y轴为对称轴，偶函数，处处不连续。
值域的取值为{0，1}，当x为有理数时，D(x)=1，无理数时，D(x)=0。

2、狄利克雷分布

在这里插入图片描述

3、共轭分布

LDA贝叶斯模型

LDA贝叶斯模型
LDA是贝叶斯模型的，有3块构成：
先验分布 + 似然（数据） = 后验分布
例如：好人坏人，
先验分布：100好人，100坏人
数据：2个好人，1个坏人
后验分布：102好人，101坏人
这个后验分布变更成了下一次的先验分布，依次更新下去。

二项分布和BETA分布

二项分布（似然）：
在这里插入图片描述
先验分布

希望先验分布 + 似然（二项分布）得到的后验分布能够作为下一次的先验分布。为此我们希望找到一个与二项式分布共轭的分布，即Beta分布。
在这里插入图片描述
后验概率

可见后验概率确实为Beta分布。

期望

多项分布和Dirichlet分布

根据上面的二项分布和Beta分布进行延申，对应于多项分布和Dirichlet分布。

似然：（多项式分布）
在这里插入图片描述

先验概率

4、LDA主题模型

在这里插入图片描述

主题-词分布是基于语料的，而文档主题分布是基于一篇文档的。
则对于一篇文本，首先通过Dirichlet-Multi分布确定主题分布，针对某个一主题根据训练好的DIrichlet-Multi分布确定词分布。示例如下：

qq_42749341

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
机器学习基础算法34-主题模型与实践

目录主题模型定义主题模型历史简单案例引入知识储备：SVD——奇异值分解1、特征值2、SVD分解3、SVD与PCAPLSA——概率隐性语义分析1、SVD2、LSA3、PLSAPlSA原理应用1、 PLSA：文档生成模型2、利用文档推断主题分布3、PLSA算法的EM推导LDA模型示意图：案例：主题预测——基于gensim1、步骤：2、代码3、部分结果案例：主题预测——基于sklearn1、步骤2、代码3、结果分析LDA 原理1、狄利克雷函数2、狄利克雷分布3、共轭分布LDA贝叶斯模型二项分布和BETA分布多项
复制链接

扫一扫