gensim(三)--相似度查询

这篇主题是查询相似的文档

和前文一样,先把文档转换为向量表示

from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

使用LSI算法(识别词与主题的关系)模型,这里使用二维向量表示主题

from gensim import models
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus])
index.save('tmp/deerwester.index')
# 内存不够时,可以使用Similarity
index = similarities.MatrixSimilarity.load('tmp/deerwester.index')
sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for i, s in enumerate(sims):
    print(s, documents[i])

结果:
第二篇和第四篇相似度很高,即使4中没有出现关键词,这就是lsi模型的作用之一,会分析潜在语义。

相似度在-1, 1之间,值越大越相似

[(0, 0.4618210045327159), (1, 0.07002766527900073)]
[(0, 0.998093), (1, 0.93748635), (2, 0.9984453), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.09879464), (8, 0.050041765)]
(2, 0.9984453) Human machine interface for lab abc computer applications
(0, 0.998093) A survey of user opinion of computer system response time
(3, 0.9865886) The EPS user interface management system
(1, 0.93748635) System and human system engineering testing of EPS
(4, 0.90755945) Relation of user perceived response time to error measurement
(8, 0.050041765) The generation of random binary unordered trees
(7, -0.09879464) The intersection graph of paths in trees
(6, -0.10639259) Graph minors IV Widths of trees and well quasi ordering
(5, -0.12416792) Graph minors A survey

不同的模型需要不同的参数,tfidf 模型只需要提供一次语料,就可以计算文档所有的特征,而LSA(Latent Semantic Analysis),LDA(Latent Dirichlet Allocation) 要更复杂,因此也需要更多时间。

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow])
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

运行结果:

[(0, 0.7071067811865476), (1, 0.7071067811865476)]
[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
2019-12-26 21:53:14,705 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-12-26 21:53:14,705 : INFO : built Dictionary(12 unique tokens: ['response', 'user', 'interface', 'minors', 'computer']...) from 9 documents (total 29 corpus positions)
2019-12-26 21:53:14,705 : INFO : collecting document frequencies
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
2019-12-26 21:53:14,705 : INFO : PROGRESS: processing document #0
2019-12-26 21:53:14,705 : INFO : calculating IDF weights for 9 documents and 12 features (28 matrix non-zeros)
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]

与lsi结合使用

lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformation
corpus_lsi = lsi_model[corpus_tfidf]

可以看到打印的日志中,topic 1的关键词是trees,graph,minors。前5篇文档与主题2相关,剩下的4篇文档和主题1相关。

2019-12-26 22:48:20,747 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-12-26 22:48:20,747 : INFO : built Dictionary(12 unique tokens: ['computer', 'time', 'minors', 'eps', 'user']...) from 9 documents (total 29 corpus positions)
2019-12-26 22:48:20,747 : INFO : collecting document frequencies
2019-12-26 22:48:20,747 : INFO : PROGRESS: processing document #0
------
2019-12-26 22:48:20,747 : INFO : calculating IDF weights for 9 documents and 12 features (28 matrix non-zeros)
2019-12-26 22:48:20,747 : INFO : using serial LSI version on this node
2019-12-26 22:48:20,747 : INFO : updating model with new documents
2019-12-26 22:48:20,747 : INFO : preparing a new chunk of documents
2019-12-26 22:48:20,748 : INFO : using 100 extra samples and 2 power iterations
2019-12-26 22:48:20,748 : INFO : 1st phase: constructing (12, 102) action matrix
2019-12-26 22:48:20,748 : INFO : orthonormalizing (12, 102) action matrix
2019-12-26 22:48:20,748 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2019-12-26 22:48:20,749 : INFO : computing the final decomposition
2019-12-26 22:48:20,749 : INFO : keeping 2 factors (discarding 47.565% of energy spectrum)
2019-12-26 22:48:20,749 : INFO : processed documents up to #9
2019-12-26 22:48:20,749 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"response" + 0.060*"time" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2019-12-26 22:48:20,749 : INFO : topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"

打印文档与主题相关度

for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)
[(0, 0.06600783396089988), (1, -0.5200703306361856)] Human machine interface for lab abc computer applications
[(0, 0.19667592859142152), (1, -0.7609563167700063)] A survey of user opinion of computer system response time
[(0, 0.08992639972446036), (1, -0.7241860626752514)] The EPS user interface management system
[(0, 0.07585847652177807), (1, -0.632055158600343)] System and human system engineering testing of EPS
[(0, 0.10150299184979901), (1, -0.5737308483002965)] Relation of user perceived response time to error measurement
[(0, 0.7032108939378314), (1, 0.1611518021402546)] The generation of random binary unordered trees
[(0, 0.8774787673119839), (1, 0.16758906864658976)] The intersection graph of paths in trees
[(0, 0.9098624686818586), (1, 0.1408655362871855)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.6165825350569283), (1, -0.053929075663897125)] Graph minors A survey

保存和加载模型

import os
import tempfile

with tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:
    lsi_model.save(tmp.name)  # same for tfidf, lda, ...

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)

LDA是LSA的概率扩展,LDA的主题可以解释为词的可能性分布
Random Projections, RP 随即预测,致力于减少向量空间维度,高效逼近词之间的tfidf距离。

 model = models.RpModel(tfidf_corpus, num_topics=500)
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
# hdp是新加入的,还很粗糙,慎用
model = models.HdpModel(corpus, id2word=dictionary)
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值