智能信息检索——基于跳表指针的倒排记录表的合并算法


《信息检索导论》部分实验python实现汇总请进入此博客查看。

1.实验目的

掌握搜索系统中基于跳表指针的倒排记录表合并算法。

2.实验任务与要求

充分理解搜索系统中基于跳表指针的倒排记录表求交集的合并算法,并通过python编程实现。当用户在提示后输入查询语句即可以实现基于跳表指针的倒排记录表合并并输出。

3.实验说明书

⑴功能描述

系统读取预设文档返回所有可查询的词项,用户通过提示输入查询词项,系统分别计算所有词项的倒排记录表,根据倒排记录表长度自动计算跳表指针长度,对于长度为L的倒排记录表,每√L处放一个跳表指针,然后执行基于跳表指针的倒排记录表合并算法,并将合并结果输出。

⑵概要设计

分为提示输入模块与基于跳表指针的倒排记录表计算模块两个功能模块。

⑶详细设计

总体流程图:
图1 总体流程图

各功能模块流程图:

  • 提示输入模块

图2 提示输入模块

  • 基于跳表指针的倒排记录表计算模块

图3基于跳表指针的倒排记录表计算模块

⑷代码实现

  • 返回某个词的倒排记录表

find函数用于返回用户所输入词项的倒排记录表用于合并计算。

'''
返回某个词的倒排记录表
'''
def find(test, dict1, dict0):
    ft0 = re.split('[()]', test)
    ft = []
    for i in range(len(ft0)):
        ft = ft + ((ft0[i].replace(' ', '')).split("AND"))
    ft = [i for i in ft if i != '']
    p = []
    for j in range(len(ft)):
        p0 = []
        if('OR' in ft[j]):
            ft1 = ft[j].split('OR')
            for k in range(len(ft1)):
                if(ft1[k] in dict1):
                    p0 = p0 + dict1[ft1[k]]
            p0 = list(set(p0))
        elif('NOT' in ft[j]):
            ft[j] = ft[j].replace('NOT', '')
            if(ft[j] in dict1):
                p0 = [y for y in dict0 if y not in dict1[ft[j]]]
            else:
                p0 = dict0
        elif(ft[j] in dict1):
            p0 = dict1[ft[j]]
        p.append(p0)
    return p

  • 跳表指针倒排记录表合并算法

Intersect函数为基于跳表指针的倒排记录表计算模块,首先通过len函数获取倒排记录表个数,然后通过下标循环获取倒排记录表,再通过循环获取倒排记录表元素,与上次合并结果进行基于跳表指针的倒排记录表合并,最终返回该结果列表。其中跳表指针长度通过sk1, sk2 = round(sqrt(lp)), round(sqrt(lr))语句计算。

'''
跳表指针倒排记录表合并算法
'''
def Intersect(p):
    r = p[0]
    for i in range(1, len(p)):
        j, k = 0, 0
        r0 = []
        lp, lr = len(p[i]), len(r)
        sk1, sk2 = round(sqrt(lp)), round(sqrt(lr))
        while(j < lp and k < lr):
            if(p[i][j] == r[k]):
                r0.append(r[k])
                j, k = j + 1, k + 1
            elif(p[i][j] < r[k]):
                if(((j + sk1) < lp) and (p[i][j + sk1] <= r[k])):
                    while(((j + sk1) < lp) and (p[i][j + sk1] <= r[k])):
                        j = j + sk1
                else:
                    j = j + 1
            else:
                if(((k + sk2) < lr) and (r[k + sk2] <= p[i][j])):
                    while(((k + sk2) < lr) and (r[k + sk2] <= p[i][j])):
                        k = k + sk2
                else:
                    k = k + 1
        r = r0
    return r
  • 创建文档词典

createdict函数调用了python字符串处理的re库,处理预设的文档,返回所有词项用于提示用户可选词项,并计算所有词项的倒排记录表。

'''
创建文档词典
'''
def createdict(f0):
	'''获取文档所有不重复非空字符串列表'''
    dl = list(set(re.split('[ \n?!,.;]', f0)))
    dl.remove('')
    '''模拟获得文档,将一段作为一个文档'''
    d = f0.split('\n')
    dict1 = {}
    dict0 = []
    for i in range(len(d)):
        dict0.append(i + 1)
        for word in dl :
            if word in d[i]:         
                if word not in dict1:                           
                    dict1[word] = [i + 1]
                else:             
                    dict1[word].append(i + 1)
    return dict1, dict0
  • 代码补全
"""
d为document,ft为findtext,r为result,dict1为原始词典,dict0为文档总数
"""

import re
from math import sqrt
'''
打开文档
'''
f = open("document.txt", "r")
f0 = f.read()
f.close()
dict1, dict0 = createdict(f0)
k = [key for key in dict1]
print("可供查询的词项为:", k, "\n")
print("请输入形如教材的标准查询:", end = '')
ft0 = input()
test = '(things OR who) AND (mean AND NOT always)'
p = find(ft0, dict1, dict0)
print("\n倒排记录表为:", p)
print("合并结果为:", Intersect(p))   

多个print函数用于提示,input函数获取用户输入的字符串,然后通过p = find(ft0, dict1, dict0)语句实现计算。test语句为调试的过程。document.txt模拟文档如下,应该可以用任意一篇英文文档尝试。

There are moments in life when you miss someone so much that you just want to pick them from your dreams and hug them for real! Dream what you want to dream;go where you want to go;be what you want to be,because you have only one life and one chance to do all the things you want to do.
May you have enough happiness to make you sweet,enough trials to make you strong,enough sorrow to keep you human,enough hope to make you happy? Always put yourself in others’shoes.If you feel that it hurts you,it probably hurts the other person, too.
The happiest of people don’t necessarily have the best of everything;they just make the most of everything that comes along their way.Happiness lies for those who cry,those who hurt, those who have searched,and those who have tried,for only they can appreciate the importance of people
Who have touched their lives.Love begins with a smile,grows with a kiss and ends with a tear.The brightest future will always be based on a forgotten past, you can’t go on well in life until you let go of your past failures and heartaches.
When you were born,you were crying and everyone around you was smiling.Live your life so that when you die,you’re the one who is smiling and everyone around you is crying.
Please send this message to those people who mean something to you,to those who have touched your life in one way or another,to those who make you smile when you really need it,to those that make you see the brighter side of things when you are really down,to those who you want to let them know that you appreciate their friendship.And if you don’t, don’t worry,nothing bad will happen to you,you will just miss out on the opportunity to brighten someone’s day with this message.

4.实验成果

输入查询语句miss AND side,得到合并结果如下图。

图4 合并结果

  • 4
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

lazyn

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值