1、环境
单机版、windows系统、python3.6、gensim模块
参考文献:
https://pypi.org/project/gensim/
https://radimrehurek.com/gensim/
https://www.jianshu.com/p/6e07729c6c5b
2、gensim安装 https://pypi.org/project/gensim/
一般可以直接通过 pip install -U gensim安装
如果没有网络环境,可以在下载相应安装包 安装(会有其他包依赖问题,需要逐个安装)
3、通过gensim计算文章相似度 https://radimrehurek.com/gensim/similarities/docsim.html
3.1 cosine相似度 cosine similarity
a)gensim.similarities.docsim.
MatrixSimilarity (矩阵向量,内存运算)
b) gensim.similarities.docsim.
Similarity (动态运算,如果MatrixSimilarity、SparseMatrixSimilarity数据量大,无法计算时,可使用)
c) gensim.similarities.docsim.
SparseMatrixSimilarity (稀疏向量输入,内存运算)
3.2 wmd相似度
gensim.similarities.docsim.
WmdSimilarity
4、简易代码
数据输入(text): 分词完之后的词向量, 如[["love","China"], ["weather", "sunny"]]
from gensim.models import Word2Vec
from gensim.similarities import WmdSimilarity, Similarity, MatrixSimilarity, SparseMatrixSimilarity
from gensim import corpora, models
#文章输入
text = [["love","China"], ["weather", "sunny"]]
#将相似度向量转成list
def index2list(index):
doc_sim_list = []
for s in index:
try:
doc_sim_list.append(s)
except:
print ("there is something woring at index : {0}".format(s))
return doc_sim_list
##WmdSimilarity
#获取词向量模型
model = Word2Vec(text, min_count=1)
#计算WmdSimilarity
index = WmdSimilarity(text, model)
doc_sim_list = index2list(index)
##cosine similarity
#构建词语字典
dictionary = corpora.Dictionary(text)
#将文章转成此向量
corpus = [dictionary.doc2bow(t) for t in text]
#SparseMatrixSimilarity
index = SparseMatrixSimilarity(corpus, num_features=len(dictionary))
doc_sim_list = index2list(index)
#MatrixSimilarity
index = MatrixSimilarity(corpus, num_features=len(dictionary))
doc_sim_list = index2list(index)
#Similarity
#idf computation
tfidf_model = models.TfidfModel(corpus)
tfidf = tfidf_model[corpus]
index = Similarity("Similarity-index", tfidf, num_features=len(dictionary))
doc_sim_list = index2list(index)