本文将用python实现倒排索引
如下,一个数据表docu_set中有三篇文章的,d1,d2,d3,如下
docu_set={'d1':'i love shanghai',
'd2':'i am from shanghai now i study in tongji university',
'd3':'i am from lanzhou now i study in lanzhou university of science and technolgy',}
下面用这张表做一个简单的搜索引擎,采用倒排索引
首先对所有文档做分词,得到文章的词向量集合
all_words=[]
for i in docu_set.values():
# cut = jieba.cut(i)
cut=i.split()
all_words.extend(cut)
set_all_words=set(all_words)
print(set_all_words)
首先对所有文档做分词,得到文章的词向量集合
{'now', 'study', 'shanghai', 'am', 'in', 'university', 'and', 'from', 'tongji', 'i', 'of', 'lanzhou', 'love', 'technolgy', 'science'}
构建倒排索引
invert