在python中,如何使用word2vec来计算句子的相似度呢?
第一种解决方法
如果使用word2vec,需要计算每个句子/文档中所有单词的平均向量,并使用向量之间的余弦相似度来计算句子相似度,代码示例如下
import numpy as np
from scipy import spatial
index2word_set = set(model.index2word)
def avg_feature_vector(sentence, model, num_features, index2word_set):
words = sentence.split()
feature_vec = np.zeros((num_features, ), dtype='float32')
n_words = 0
for word in words:
if word in index2word_set:
n_words += 1
feature_vec = np.add(feature_vec, model[word])
if (n_words > 0):
feature_vec = np.divide(feature_vec, n_words)
return feature_vec
计算相似度:
s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)
> 0.915479828613
第二种解决思路
Word2Vec有一些扩展用于比较较长的文本,可以解决短语或句子比较的问题。其中之一是paragraph2vec或doc2vec。
详见“分布式句子和文档表示”http://cs.stanford.edu/~quocle/paragraph_vector.pdf
http://rare-technologies.com/doc2vec-tutorial/
其他解决方法
要计算句子相似度,也可以使用Word Mover距离算法。这里是一个easy description about WMD。
#load word2vec model, here GoogleNews is used
model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
#two sample sentences
s1 = 'the first sentence'
s2 = 'the second text'
#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)
print ('distance = %.3f' % distance)