python 实现倒排索引，建立简单的搜索引擎

最新推荐文章于 2024-03-06 17:17:47 发布

luoganttcc

最新推荐文章于 2024-03-06 17:17:47 发布

阅读量1.7w

点赞数 27

分类专栏：算法

本文链接：https://blog.csdn.net/luoganttcc/article/details/89843699

版权

算法专栏收录该内容

100 篇文章 1 订阅

订阅专栏

本文将用python实现倒排索引

如下，一个数据表docu_set中有三篇文章的,d1,d2,d3,如下

docu_set={'d1':'i love shanghai',
          'd2':'i am from shanghai now i study in tongji university',
          'd3':'i am from lanzhou now i study in lanzhou university of science  and  technolgy',}

下面用这张表做一个简单的搜索引擎，采用倒排索引
首先对所有文档做分词，得到文章的词向量集合

all_words=[]
for i in docu_set.values():
#    cut = jieba.cut(i)
    cut=i.split()
    all_words.extend(cut)
    
set_all_words=set(all_words)
print(set_all_words)

首先对所有文档做分词，得到文章的词向量集合

{'now', 'study', 'shanghai', 'am', 'in', 'university', 'and', 'from', 'tongji', 'i', 'of', 'lanzhou', 'love', 'technolgy', 'science'}

构建倒排索引

invert_index=dict()
for b in set_all_words:
    temp=[]
    for j in docu_set.keys():
        
        field=docu_set[j]
        
        split_field=field.split()
        
        if b in split_field:
            temp.append(j)
    invert_index[b]=temp     
print(invert_index)

倒排索引如下

{'now': ['d2', 'd3'],
 'study': ['d2', 'd3'],
 'shanghai': ['d1', 'd2'],
 'am': ['d2', 'd3'],
 'in': ['d2', 'd3'],
 'university': ['d2', 'd3'],
 'and': ['d3'],
 'from': ['d2', 'd3'],
 'tongji': ['d2'],
 'i': ['d1', 'd2', 'd3'],
 'of': ['d3'],
 'lanzhou': ['d3'],
 'love': ['d1'],
 'technolgy': ['d3'],
 'science': ['d3']}

全文搜索　‘university’