使用gensim tf-idf模型求文本相似度

最新推荐文章于 2023-11-22 10:16:52 发布

姚贤贤

最新推荐文章于 2023-11-22 10:16:52 发布

阅读量6k

点赞数

分类专栏：机器学习文章标签：文本相似度 gensim tf-idf 机器学习人工智能

本文链接：https://blog.csdn.net/u011311291/article/details/79158755

版权

机器学习专栏收录该内容

85 篇文章 6 订阅

订阅专栏

#coding=utf-8
'''
Created on 2018-1-24

优点:计算出来的效果不错
缺点:为了计算tfidf值，需要多篇文章作为铺垫
'''
import jieba
from gensim import corpora, models, similarities
# gensim的模型model模块，可以对corpus进行进一步的处理，比如tf-idf模型，lsi模型，lda模型等
wordstest_model = ["我去玉龙雪山并且喜欢玉龙雪山玉龙雪山","我在玉龙雪山并且喜欢玉龙雪山","我在九寨沟"]
test_model = [[word for word in jieba.cut(words)] for words in wordstest_model]
dictionary = corpora.Dictionary(test_model,prune_at=2000000)
# for key in dictionary.iterkeys():
#     print key,dictionary.get(key),dictionary.dfs[key]
corpus_model= [dictionary.doc2bow(test) for test in test_model]
print corpus_model
# [[(0, 1), (1, 3), (2, 1), (3, 1), (4, 1)], [(0, 1), (1, 2), (3, 1), (4, 1), (5, 1)], [(0, 1), (5, 1), (6, 1)]]

# 目前只是生成了一个模型,并不是将对应的corpus转化后的结果,里面存储有各个单词的词频，文频等信息
tfidf_model = models.TfidfModel(corpus_model)
# 对语料生成tfidf
corpus_tfidf = tfidf_model[corpus_model]

#使用测试文本来测试模型，提取关键词,test_bow提供当前文本词频，tfidf_model提供idf计算
testword = "我在九寨沟,很喜欢"
test_bow = dictionary.doc2bow([word for word in jieba.cut(testword)])
test_tfidf = tfidf_model[test_bow]
print test_tfidf
# 词id,tfidf值
# [(4, 0.32718457421365993), (5, 0.32718457421365993), (6, 0.8865102981879297)]

# 计算相似度
index = similarities.MatrixSimilarity(corpus_tfidf) #把所有评论做成索引
sims = index[test_tfidf]  #利用索引计算每一条评论和商品描述之间的相似度
print sims
# [ 0.07639694  0.2473283   0.94496047]