gensim计算文本相似度

最新推荐文章于 2024-08-07 17:50:00 发布

hejp_123

最新推荐文章于 2024-08-07 17:50:00 发布

阅读量3.3k

点赞数 7

分类专栏：自然语言文章标签： gensim 文本相似度

自然语言专栏收录该内容

15 篇文章 2 订阅

订阅专栏

1、gensim使用流程

在这里插入图片描述

2、代码实现

from gensim import corpora, models, similarities
import jieba
# 文本集和搜索词
texts = ['吃鸡这里所谓的吃鸡并不是真的吃鸡，也不是我们常用的谐音词刺激的意思',
         '而是出自策略射击游戏《绝地求生：大逃杀》里的台词',
         '我吃鸡翅，你吃鸡腿']
keyword = '玩过吃鸡？今晚一起吃鸡'
# 1、将【文本集】生成【分词列表】
texts = [jieba.lcut(text) for text in texts]
# 2、基于文本集建立【词典】，并提取词典特征数
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id)
# 3、基于词典，将【分词列表集】转换成【稀疏向量集】，称作【语料库】
corpus = [dictionary.doc2bow(text) for text in texts]
# 4、使用【TF-IDF模型】处理语料库
tfidf = models.TfidfModel(corpus)
# 5、同理，用【词典】把【搜索词】也转换为【稀疏向量】
kw_vector = dictionary.doc2bow(jieba.lcut(keyword))
# 6、对【稀疏向量集】建立【索引】
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)
# 7、相似度计算
sim = index[tfidf[kw_vector]]
for i in range(len(sim)):
    print('keyword 与 text%d 相似度为：%.2f' % (i + 1, sim[i]))

  
  
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

打印结果

keyword 与 text1 相似度为：0.62
keyword 与 text2 相似度为：0.00
keyword 与 text3 相似度为：0.12

3、过程拆解

3.1、生成分词列表

对文本集中的文本进行中文分词，返回 分词列表，格式如下：

[‘word1’, ‘word2’, ‘word3’, …]

import jieba
text = '七月七日长生殿，夜半无人私语时。'
words = jieba.lcut(text)

  
  
  
  1
2
3

print(words)

[‘七月’, ‘七日’, ‘长生殿’, ‘，’, ‘夜半’, ‘无人’, ‘私语’, ‘时’, ‘。’]

3.2、基于文本集建立`词典`，获取特征数

corpora.Dictionary：建立词典
len(dictionary.token2id)：词典中词的个数

from gensim import corpora
import jieba
# 文本集
text1 = '坚果果实'
text2 = '坚果实在好吃'
texts = [text1, text2]
# 将文本集生成分词列表
texts = [jieba.lcut(text) for text in texts]
print('文本集：', texts)
# 基于文本集建立词典
dictionary = corpora.Dictionary(texts)
print('词典：', dictionary)
# 提取词典特征数
feature_cnt = len(dictionary.token2id)
print('词典特征数：%d' % feature_cnt)

  
  
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

打印结果

文本集： [[‘坚果’, ‘果实’], [‘坚果’, ‘实在’, ‘好吃’]]
词典： Dictionary(4 unique tokens: [‘坚果’, ‘果实’, ‘好吃’, ‘实在’])
词典特征数：4

3.3、基于词典建立`语料库`

语料库即存放稀疏向量的列表

from gensim import corpora
import jieba
text1 = '来东京吃东京菜'
text2 = '东京啊东京啊东京'
texts = [text1, text2]
texts = [jieba.lcut(text) for text in texts]
dictionary = corpora.Dictionary(texts)
print('词典（字典）：', dictionary.token2id)
# 基于词典建立新的【语料库】
corpus = [dictionary.doc2bow(text) for text in texts]
print('语料库：', corpus)

  
  
  
  1
2
3
4
5
6
7
8
9
10
11

打印结果

词典（字典）： {‘东京’: 0, ‘吃’: 1, ‘来’: 2, ‘菜’: 3, ‘啊’: 4}
语料库： [[(0, 2), (1, 1), (2, 1), (3, 1)], [(0, 3), (4, 2)]]

doc2bow函数生成稀疏向量

1、将所有单词取【集合】，并对每个单词分配一个ID号

以 ['东京', '啊', '东京', '啊', '东京']为例
对单词分配ID： 东京→ 0； 啊→ 4
变成： [0, 4, 0, 4, 0]

2、转换成稀疏向量

0有 3个，即表示为( 0, 3)
4有 2个，即表示为( 4, 2)
最终结果：[( 0, 3), ( 4, 2)]

3.4、使用`TF-IDF`模型处理语料库，并建立`索引`

TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度

from gensim import corpora, models, similarities
import jieba
text1 = '南方医院无痛人流'
text2 = '北方人流浪到南方'
texts = [text1, text2]
texts = [jieba.lcut(text) for text in texts]
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id.keys())
corpus = [dictionary.doc2bow(text) for text in texts]
# 用TF-IDF处理语料库
tfidf = models.TfidfModel(corpus)
# 对语料库建立【索引】
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)

  
  
  
  1
2
3
4
5
6
7
8
9
10
11
12
13

print(tfidf)

TfidfModel(num_docs=2, num_nnz=9)

3.5、用词典把搜索词转成稀疏向量

from gensim import corpora
import jieba
text1 = '南方医院无痛人流'
text2 = '北方人流落南方'
texts = [text1, text2]
texts = [jieba.lcut(text) for text in texts]
dictionary = corpora.Dictionary(texts)
# 用【词典】把【搜索词】也转换为【稀疏向量】
keyword = '无痛人流'
kw_vector = dictionary.doc2bow(jieba.lcut(keyword))

  
  
  
  1
2
3
4
5
6
7
8
9
10

print(kw_vector)

[(0, 1), (3, 1)]

3.6、相似度计算

from gensim import corpora, models, similarities
import jieba
text1 = '无痛人流并非无痛'
text2 = '北方人流浪到南方'
texts = [text1, text2]
keyword = '无痛人流'
texts = [jieba.lcut(text) for text in texts]
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
new_vec = dictionary.doc2bow(jieba.lcut(keyword))
# 相似度计算
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)
print('\nTF-IDF模型的稀疏向量集：')
for i in tfidf[corpus]:
    print(i)
print('\nTF-IDF模型的keyword稀疏向量：')
print(tfidf[new_vec])
print('\n相似度计算：')
sim = index[tfidf[new_vec]]
for i in range(len(sim)):
    print('第', i+1, '句话的相似度为：', sim[i])

  
  
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

这里写图片描述

4、附录

阅读扩展
jieba中文分词
 中文LDA模型
 文本相似度分析【矩阵版】
注释

En	Cn
corpus	n. 文集；[计]语料库（复数：`corpora`）
sparse	adj. 稀疏的
vector	n. 矢量
Sparse Matrix Similarity	稀疏矩阵相似性
word2vec	word to vector
doc2bow	document to `bag of words`（词袋）