介绍
倒排索引就好比全文检索,输入一个单词,查询有哪些文章中有该单词。
基本思路就是把所有文章的单词set存储起来,将去重后的单词建立一个字典,然后对于每篇文章中的单词挨个添加到字典当中,在查询的时候,其实就是查的字典!
Code
词向量集合
def Split_words(Article_all):
all_words = []
for i in Article_all.values():
# cut = jieba.cut(i)
cut = i.split()
all_words.extend(cut)
set_all_words = set(all_words)
return set_all_words
构建倒排索引
def Invert_index(Article_all, set_all_words):
invert_index = dict()
for b in set_all_words:
temp = []
for j in Article_all.keys():
field = Article_all[j]
split_field = field.split()
if b in split_field:
temp.append(j)
invert_index[b] = temp
# print(invert_index)
return invert_index
Main
# Article_all 所有文章
# set_all_words 去重的set
# invert_index 索引字典
# 先调用 Split_words
# 其次 Invert_index