NLP03-Gensim转换与相似计算

最新推荐文章于 2022-03-05 00:00:00 发布

happyprince

最新推荐文章于 2022-03-05 00:00:00 发布

阅读量422

点赞数

分类专栏： python NLP 文章标签： nlp python Gensim

本文链接：https://blog.csdn.net/ld326/article/details/78357172

版权

NLP 同时被 2 个专栏收录

79 篇文章 6 订阅

订阅专栏

python

42 篇文章 1 订阅

订阅专栏

这里写图片描述
摘要：根据Gensim官网的说明文档，进行动手操作，记录实践过程，为以后及相关学习同伙作参考。

学习来源

Topics and Transformations https://radimrehurek.com/gensim/tut2.html
Similarity Queries https://radimrehurek.com/gensim/tut3.html

说明与入门代码

以下的数据都是来之前一篇文章生成的数据，这里把这些数据加载入来作相关的处理；相关生成数据集的代码参考：http://blog.csdn.net/ld326/article/details/78353338

转换

import os

from gensim.models import ldamodel, hdpmodel

"""
向量转换
In this tutorial, I will show how to transform documents from one vector representation into another. 
This process serves two goals:
  1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
  2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).
在这个教程中，将合演示文档从一个向量向另一个向的转换，这个处理的目的：
   1. 去发现语料库中隐藏的结构，发现词之间的关系，并使用它们采用新的方法与更语义的方法去描述文档；
   2. 使文档表示更紧凑。提高效率[花更少的资源]与功效。
"""
from gensim import corpora, models, similarities
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 从第一个教程中生成的数据集中加载数据
if (os.path.exists("tmp/deerwester.dict")):
    dictionary = corpora.Dictionary.load('tmp/deerwester.dict')
    corpus = corpora.MmCorpus('tmp/deerwester.mm')
    print("Used files generated from first tutorial")
    print("使用第一个教程生成的数据集")
else:
    print("请运行第一个教程去生成数据集")

# 第一个转换：tfidf：由整型的词袋向量空间转去实数的向量空间；返回相同维的向量，对于稀缺的属性会有比较大的权重，由整型转向实型
doc_bow = [(0, 1), (1, 1)]
# step 1 -- 初始化模型 有相同的属性-编号相同
tfidf = models.TfidfModel(corpus)
# step 2 -- 使用模型去转换数据
print('转换:[(0, 1), (1, 1)]->%s' % str(tfidf[doc_bow]))
# 转换整个语料库:调用tfidf[corpus]在基础上创建一个包装，真正的转化是在迭代文档时计算的
corpus_tfidf = tfidf[corpus]
print('转换整个语料库：')
for doc in corpus_tfidf:
    print(doc)
# 第二个转换：潜在语义索引（LSI），bow->tfidf->fold-in-lsi [对原始语料加双重包装]
# LSI：由词袋或TfIdf权重空间（更好）转化为低维的潜在空间
# 适合增量更新
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi [双重包装]
lsi.print_topics(2)  # LSI将Tf-Idf语料转化为潜在2D空间（num_topics=2）
# 保存转换数据（same for tfidf, lda, ...）
lsi.save('tmp/model.lsi')
# 加载转换数据
lsi = models.LsiModel.load('tmp/model.lsi')
print('主题情况：')
for doc in corpus_lsi:  # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

# 第三个转换：隐含狄利克雷分布（Latent Dirichlet Allocation, LDA）
#  将词袋计数转化为一个低维主题空间的转换。LDA是LSA的概率扩展。
lda_m = ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lda = lda_m[corpus_tfidf]
print('LDA主题：')
for doc in corpus_lda:  # both bow->tfidf and tfidf->lda transformations are actually executed here
    print(doc)

# 第四个转换：分层狄利克雷过程（Hierarchical Dirichlet Process，HDP）
# 无参贝叶斯方法（没有num_topics参数），这个方法还未成熟，要小心使用
hdp_m = hdpmodel.HdpModel(corpus_tfidf, id2word=dictionary)
corpus_hdp = hdp_m[corpus_tfidf]
print('HDP主题：')
for doc in corpus_hdp:  # both bow->tfidf and tfidf->hdp transformations are actually executed here
    print(doc)

生成结果：

Used files generated from first tutorial
使用第一个教程生成的数据集
转换:[(0, 1), (1, 1)]->[(0, 0.7071067811865476), (1, 0.7071067811865476)]
转换整个语料库：
[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(2, 0.44424552527467476), (3, 0.3244870206138555), (4, 0.44424552527467476), (5, 0.44424552527467476), (6, 0.3244870206138555), (7, 0.44424552527467476)]
[(0, 0.5710059809418182), (3, 0.4170757362022777), (6, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (3, 0.7184811607083769), (8, 0.49182558987264147)]
[(4, 0.6282580468670046), (6, 0.45889394536615247), (7, 0.6282580468670046)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(5, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]
主题情况：
[(0, 0.066007833960902457), (1, -0.52007033063618502)]
[(0, 0.19667592859142424), (1, -0.76095631677000475)]
[(0, 0.08992639972446359), (1, -0.72418606267525076)]
[(0, 0.075858476521781348), (1, -0.63205515860034289)]
[(0, 0.10150299184980076), (1, -0.57373084830029542)]
[(0, 0.70321089393783143), (1, 0.1611518021402569)]
[(0, 0.87747876731198338), (1, 0.16758906864659284)]
[(0, 0.90986246868185816), (1, 0.14086553628718884)]
[(0, 0.61658253505692828), (1, -0.053929075663894474)]
LDA主题：
[(0, 0.43458819379220637), (1, 0.56541180620779363)]
[(0, 0.81496850961257628), (1, 0.18503149038742375)]
[(0, 0.42398633716223832), (1, 0.57601366283776179)]
[(0, 0.27284273175270496), (1, 0.72715726824729499)]
[(0, 0.78961713183549342), (1, 0.21038286816450658)]
[(0, 0.27046663485539763), (1, 0.72953336514460243)]
[(0, 0.23738848068832419), (1, 0.76261151931167581)]
[(0, 0.27880748145925172), (1, 0.72119251854074828)]
[(0, 0.64582453064154577), (1, 0.35417546935845429)]
HDP主题：
[(0, 0.52809018857755152), (1, 0.26553584360391191), (2, 0.051983022299278911), (3, 0.03886440092379026), (4, 0.029208773283318439), (5, 0.021975161934421561), (6, 0.016341636429796386), (7, 0.012334101686562229)]
[(0, 0.074099556860078936), (1, 0.76127241803674017), (2, 0.041537586886014721), (3, 0.030963727406811727), (4, 0.023292083182027672), (5, 0.017524199077066691), (6, 0.013031743285505505)]
[(0, 0.48437437363602542), (1, 0.32579435434464177), (2, 0.048112070195248331), (3, 0.035668320063564463), (4, 0.02681313352299294), (5, 0.020172686197132431), (6, 0.015001253112536254), (7, 0.011322426747553908)]
[(0, 0.34034073129107029), (1, 0.45101804381381155), (2, 0.052557228855974372), (3, 0.039278480525089025), (4, 0.029532032771654788), (5, 0.022218461058883556), (6, 0.016522573829929454), (7, 0.012470666950317679)]
[(0, 0.094173500744889241), (1, 0.070742471007394905), (2, 0.67977887835637896), (3, 0.039071041145108294), (4, 0.029387407680202364), (5, 0.022109782761934496), (6, 0.016441782387953123), (7, 0.012409688403626294)]
[(0, 0.6243532030336858), (1, 0.093931816536177784), (2, 0.070751477981715785), (3, 0.053146621419023092), (4, 0.039904675525158521), (5, 0.030018566595092239), (6, 0.022323090455904304), (7, 0.016848696236704659), (8, 0.012506198130003471)]
[(0, 0.68886300433857051), (1, 0.077826159523606567), (2, 0.058598719917462287), (3, 0.043975113679573102), (4, 0.033055336187415012), (5, 0.024868191804956103), (6, 0.018493053625294565), (7, 0.013957916979065739), (8, 0.010360473758344944)]
[(0, 0.49182518940144543), (1, 0.30015320078544222), (2, 0.052469639197019941), (3, 0.039153986517730457), (4, 0.029430280423350139), (5, 0.022140594553867854), (6, 0.016464688924120365), (7, 0.01242697745241703)]
[(0, 0.27845680026889291), (1, 0.51398051031892455), (2, 0.052268236069297966), (3, 0.039060866109315549), (4, 0.029386884419279093), (5, 0.022109784517034499), (6, 0.016441782381119849), (7, 0.012409688403623432)]

相似计算

import os

from gensim import corpora, models, similarities

# 从第一个教程中生成的数据集中加载数据
if (os.path.exists("tmp/deerwester.dict")):
    dictionary = corpora.Dictionary.load('tmp/deerwester.dict')
    corpus = corpora.MmCorpus('tmp/deerwester.mm')
    print("Used files generated from first tutorial")
    print("使用第一个教程生成的数据集")
else:
    print("请运行第一个教程去生成数据集")

# 第一个转换：tfidf：由整型的词袋向量空间转去实数的向量空间；返回相同维的向量，对于稀缺的属性会有比较大的权重，由整型转向实型
doc_bow = [(0, 1), (1, 1)]
# 初始化模型 有相同的属性-编号相同
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi [双重包装]
lsi.print_topics(2)  # LSI将Tf-Idf语料转化为潜在2D空间（num_topics=2）

"""
相似度查询（Similarity Queries）
"""
if (os.path.exists("tmp/deerwester.dict")):
    dictionary = corpora.Dictionary.load('tmp/deerwester.dict')
    corpus = corpora.MmCorpus('tmp/deerwester.mm')
    print("Used files generated from first tutorial")
    print("使用第一个教程生成的数据集")
else:
    print("请运行第一个教程去生成数据集")
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
# convert the query to LSI space
vec_lsi = lsi[vec_bow]
print(vec_lsi)
# 建立索引
# similarities.MatrixSimilarity类适合所有向量在都在内存的情况，如果数据比较大的话采用similarities.Similarity类；
index = similarities.MatrixSimilarity(lsi[corpus])
# 保存索引
index.save('tmp/deerwester.index')
# 加载索引
index = similarities.MatrixSimilarity.load('tmp/deerwester.index')
# 查询
sims = index[vec_lsi]  # perform a similarity query against the corpus
print('文档：%s' % doc)
print(list(enumerate(sims)))
# 查询排序结果
print('排序后的结果：')
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)  # print sorted (document number, similarity score) 2-tuples

运行的结果：

Used files generated from first tutorial
使用第一个教程生成的数据集
Used files generated from first tutorial
使用第一个教程生成的数据集
[(0, 0.079104751174449178), (1, 0.57328352430794038)]
文档：Human computer interaction
[(0, 0.99994081), (1, 0.99467081), (2, 0.99994278), (3, 0.999879), (4, 0.99935204), (5, -0.08804217), (6, -0.0515742), (7, -0.023664713), (8, 0.1938726)]
排序后的结果：
[(2, 0.99994278), (0, 0.99994081), (3, 0.999879), (4, 0.99935204), (1, 0.99467081), (8, 0.1938726), (7, -0.023664713), (6, -0.0515742), (5, -0.08804217)]

【作者：happyprince, http://blog.csdn.net/ld326/article/details/78357172】

happyprince

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
NLP03-Gensim转换与相似计算

摘要：根据Gensim官网的说明文档，进行动手操作，记录实践过程，为以后及相关学习同伙作参考。学习来源Topics and Transformations https://radimrehurek.com/gensim/tut2.html Similarity Queries https://radimrehurek.com/gensim/tut3.html说明与入门代码以下的数据都是来之前一篇文
复制链接

扫一扫

专栏目录