信息检索与数据挖掘 | 【实验】倒排索引与布尔查询

啦啦右一

已于 2023-10-08 16:43:51 修改

阅读量942

点赞数 5

分类专栏：右一的实操记录合集文章标签：数据挖掘人工智能信息检索

于 2023-09-26 14:54:12 首次发布

本文链接：https://blog.csdn.net/m0_63398413/article/details/133284758

版权

大数据与数据分析同时被 3 个专栏收录

68 篇文章

订阅专栏

右一的实操记录合集

65 篇文章

订阅专栏

信息检索与数据挖掘

15 篇文章

订阅专栏

文章目录

📚实验内容
📚实验步骤
📚问题记录

📚实验内容

在tweets数据集上构建inverted index。
Boolean Retrieval Model:
- Input：a query(like Ron and Weasley)
- Output：print the qualified tweets.
- 分别实现and，or，not，and not四种查询对应的函数，不需要做查询优化。
对于tweets与queries使用相同的预处理。

📚实验步骤

🐇分词预处理

将输入的推特文档转换为小写。这里统一处理，使得后续查询不区分大小写。
根据特定标记在推特文档中查找并确定关键部分信息的位置索引，并提取出推特文档中的tweetid和text内容。
对提取出的文本内容进行分词处理，并将单词转换为其单数形式。TextBlob(document).words.singularize()
对分词后的词列表进行词形还原，主要针对动词的还原操作。
筛去[“text”, “tweetid”]，将其余有效词添加到最终结果列表中，并返回。

#分词预处理
def tokenize_tweet(document):
    # 统一处理使查询不区分大小写
    document = document.lower()
    # 根据特定标记在推特文档中查找并确定关键部分信息的位置索引
    # 这里的减1减3是对引号逗号切入与否的调整
    a = document.index("tweetid") - 1
    b = document.index("errorcode") - 1
    c = document.index("text") - 1
    d = document.index("timestr") - 3
    # 将推特文档中的tweetid和text内容主要信息提取出来
    document = document[a:b] + document[c:d]
    # print(document)
    # 分词处理，并将单词转换为其单数形式
    terms = TextBlob(document).words.singularize()
    # print(terms)
    # 将分词后的词列表进行词形还原，并筛选出不属于无用词的有效词
    result = []
    for word in terms:
        # 将当前词转换为Word对象
        expected_str = Word(word)
        # 动词的还原操作
        expected_str = expected_str.lemmatize("v")
        if expected_str not in uselessTerm:
            # 筛去["text", "tweetid"]，添加到result中
            result.append(expected_str)
    # print(result)
    return result

提取结果展示
分词结果展示
预处理最终结果

🐇构建倒排索引表：键为有效词，对应的值为对应tweetid的列表

读取"Tweets.txt"，并将其内容存入变量content。
将读取出的内容按行分割，每一行表示一条推特，存储在列表lines中。
遍历lines列表中的每一行数据。
- 对每一行数据进行预处理，调用tokenize_tweet函数对推特文档进行分词和处理。
- 从处理后的词列表line中提取并删除第一个元素，即当前行推特对应的tweetid。
- 将处理后的词列表line转换为集合unique_terms，这里的目的是去重。
- 遍历集合unique_terms中的每个词项。
  - 检查当前词项是否已存在于postings字典的键中。
  - 如果当前词项已存在于postings的键中，则将当前行推特的tweetid追加到对应的值列表中。
  - 如果当前词项不存在于postings的键中，则创建一个新的键值对，以当前词项为键，当前行推特的tweetid作为值的列表。

# 构建倒排索引表,键为有效词，对应的值是包含该词的推特文档的tweetid的列表
def get_postings():
    global postings
    content = open(r"Tweets.txt")
    # 内容读取，每一条推特作为一个元素存储在lines中
    lines = content.readlines()
    for line in lines:
        # 预处理
        line = tokenize_tweet(line)
        # 提取处理后的词列表中的第一个元素，即推特文档的tweetid
        tweetid = line[0]
        # 提取后删除，不作为有效词
        line.pop(0)
        # 将处理后的词列表转换为集合，获取其中的唯一词
        unique_terms = set(line)
        # 遍历每个词项
        for key_word in unique_terms:
            if key_word in postings.keys():
                # 如果当前词已经存在于postings的键中
                # 将当前推特文档的tweetid追加
                postings[key_word].append(tweetid)
            else:
                # 如果当前词不存在
                # 创建一个新的键值对
                postings[key_word] = [tweetid]
    # data = open("postings.txt", 'w', encoding="utf-8")
    # print(postings, file=data)

后续and合并需要id从小到大，所以本来是要加个排序，但目测原数据集已经给排好了。
但是从postings结果来看，这个排序还是很有必要，比如第一个rand就没有严格从小到大。
于是进行改进，加一个排序，使得tweetid从小到大排序，这里了解到用bisect模块中的insort函数来插入新的tweetid并保持有序。这样可以保持较低的复杂度（相比sorted()）。

import bisect

# 构建倒排索引表,键为有效词，对应的值是包含该词的推特文档的tweetid的列表
def get_postings():
    global postings
    content = open(r"Tweets.txt")
    # 内容读取，每一条推特作为一个元素存储在lines中
    lines = content.readlines()
    for line in lines:
        # 预处理
        line = tokenize_tweet(line)
        # 提取处理后的词列表中的第一个元素，即推特文档的tweetid
        tweetid = line[0]
        # 提取后删除，不作为有效词
        line.pop(0)
        # 将处理后的词列表转换为集合，获取其中的唯一词
        unique_terms = set(line)
        # 遍历每个词项
        for key_word in unique_terms:
            if key_word in postings.keys():
                # 如果当前词已经存在于postings的键中
                # 将当前推特文档的tweetid插入有序位置，按从小到大排序
                bisect.insort(postings[key_word], tweetid)
            else:
                # 如果当前词不存在
                # 创建一个新的键值对
                postings[key_word] = [tweetid]
    # data = open("postings.txt", 'w', encoding="utf-8")
    # print(postings, file=data)

此时生成的postings可以看到是严格从小到大的

在这里插入图片描述

🐇对应的布尔运算

🔥And运算：A and B

首先检查term1和term2是否都存在于全局变量postings的键中，这是and的前提。有一方不在就返回空。
如果term1和term2都存在于postings的键中，执行以下操作：
- 分别获取term1和term2对应的倒排索引表长度length_1、length_2。
- 初始化两个索引变量x和y，分别表示term1和term2对应倒排列表中的当前位置，初始值都为0。
- 进入循环，检查x和y是否都在各自倒排列表的有效范围内。如果是，则执行以下操作：
  - 如果term1对应倒排列表中的元素和term2对应倒排列表中的元素相等（postings[term1][x] == postings[term2][y]），表示该id对应的推特既包含term1又包含term2。将该推特的tweetid添加到结果列表answer中，并递增索引x和y，以比较下一组元素。
  - 如果term1对应倒排列表中的元素小于term2对应倒排列表中的元素，递增索引x，以继续比较term1对应倒排列表中的下一个元素。
  - 反之，递增索引y，以继续比较term2对应倒排列表中的下一个元素。
- 循环结束后，返回结果列表answer作为函数的输出。

# and运算
# 找出既包含第一个词项又包含第二个词项的推特对应id。
def bool_and(term1, term2):
    global postings
    answer = []
    # 如果其中任一词项不在键中，则返回空列表
    if (term1 not in postings) or (term2 not in postings):
        return answer
    # 如果两个词项都存在于postings的键中
    else:
        # 获取倒排索引表两个词项对应的列表的长度
        length_1 = len(postings[term1])
        length_2 = len(postings[term2])
        # 初始化两个变量x和y，分别表示两个词项的索引
        x = 0
        y = 0
        while x < length_1 and y < length_2:
            # 如果当前两个词项对应列表中的元素相等，则表示该推特文档既包含term1又包含term
            if postings[term1][x] == postings[term2][y]:
                # 将当前推特文档的tweetid添加到结果列表
                answer.append(postings[term1][x])
                # 递增索引x和y，以继续比较下一个元素
                x += 1
                y += 1
            # term1词项对应列表中的元素小于term2词项对应列表中的元素
            elif postings[term1][x] < postings[term2][y]:
                # 递增索引x，以继续比较下一个元素
                x += 1
            else:
                # 反之，递增索引y
                y += 1
        return answer

🔥Or运算：A or B

首先检查term1和term2是否都不存在于全局变量postings的键中。
- 如果两者都不存在，则将结果列表answer保持为空列表的状态。
- 如果term2不存在，则将结果列表answer设置为term1对应的倒排列表。answer = postings[term1]
- 如果term1不存在，则将结果列表answer设置为term2对应的倒排列表。answer = postings[term2]
如果经过前面的判断，term1和term2都存在于postings的键中，执行以下操作：
- 首先将结果列表answer设置为term1对应的倒排列表，即先把term1对应的推特id都存上。
- 遍历term2对应的倒排列表中的每一项item，逐一检查是否存在于结果列表answer中。如果item不在answer中，则将该项添加到answer列表中，即在term2里比对term1，进行查漏补缺。
- 循环结束后，返回结果列表answer作为函数的输出。

# or运算
# 找出包含至少一个词项的推特对应id。
def bool_or(term1, term2):
    answer = []
    if (term1 not in postings) and (term2 not in postings):
        answer = []
    elif term2 not in postings:
        answer = postings[term1]
    elif term1 not in postings:
        answer = postings[term2]
    else:
        # 如果两个词项都存在于postings的键中,合并两个词项的倒排索引列表
        # 先把term1的都存上
        answer = postings[term1]
        for item in postings[term2]:
            # 然后在term2里查漏补缺
            if item not in answer:
                answer.append(item)
    return answer

🔥Not运算：not A

遍历所有的键。对于每个键，判断是否不等于term1并且对应的倒排索引列表中的推特id不在postings[term1]中。if key != term1 and all(doc not in postings[term1] for doc in postings[key])
如果条件成立，将该键对应的倒排索引列表添加到answer中。
去除answer中的重复推特文档answer = list(set(answer))，返回最终的结果answer。

# not操作
# 返回不包含词项term1的全部推特对应id
def bool_not(term1):
    answer = []
    # 遍历所有的键
    for key in postings.keys():
        # 如果键不等于term1，并且对应的倒排索引列表中的文档不在postings[term1]中，则将该文档添加到结果列表中
        if key != term1 and all(doc not in postings[term1] for doc in postings[key]):
            answer += postings[key]
    # 去除重复的推特文档
    answer = list(set(answer))
    return answer

🔥And not运算：A and not B

如果词项term1不在postings的键中，说明term1没有对应的倒排列表，直接返回空列表answer。
如果词项term2不在postings的键中，说明term2没有对应的倒排列表，将term1对应的倒排列表直接赋值给answer，并返回。
如果两个词项都存在于postings的键中，则将term1对应的倒排列表赋值给temp，用作后续比对。
- 遍历temp列表中的每个文档ID tweetid。
- 如果tweetid不在term2对应的倒排列表中，说明这是term1有的而term2没有的推特ID，将tweetid添加到结果列表answer中。
```
temp = postings[term1]
answer = []
for tweetid in temp:
 	# 要是term1有的term2没有就存到结果列表里
	if tweetid not in postings[term2]:
    	answer.append(tweetid)
return answer
```
- 返回结果列表answer，其中包含了只存在于第一个词项term1的推特ID。

# and_not
# 找出包含第一个词项但不包含第二个的推特文档
def bool_and_not(term1, term2):
    temp = []
    # 如果词项term1不在postings的键中，则直接返回空列表
    if term1 not in postings:
        return temp
    # 如果词项term2不在，则将term1对应的倒排索引列表直接返回。
    elif term2 not in postings:
        temp = postings[term1]
        return temp
    # 如果两个词项都存在于postings的键中
    else:
        # 存term1列表
        temp = postings[term1]
        answer = []
        for tweetid in temp:
            # 要是term1有的term2没有就存到结果列表里
            if tweetid not in postings[term2]:
                answer.append(tweetid)
        return answer

🐇queries预处理及检索函数

🔥对输入的文本进行词法分析和标准化处理

# 对输入的文本进行词法分析和标准化处理
def token(doc):
    # 将输入文本转换为小写字母，以便统一处理。
    doc = doc.lower()
    # 将文本拆分为单个词项，并尝试将词项转换为单数形式
    terms = TextBlob(doc).words.singularize()
    # 将分词后的词列表进行词形还原,返回结果列表result
    result = []
    for word in terms:
        expected_str = Word(word)
        expected_str = expected_str.lemmatize("v")
        result.append(expected_str)
    return result

🔥检索函数

# 检索函数
def do_search():
    terms = token(input("Search query >> "))
    if terms == []:
        sys.exit()
    # 搜索的结果答案
    if len(terms) == 2:
        # not A
        if terms[0] == "not":
            answer = bool_not(terms[1])
            print(answer)
        else:
            print("input wrong!")
    elif len(terms) == 3:
        # A and B
        if terms[1] == "and":
            answer = bool_and(terms[0], terms[2])
            print(answer)
        # A or B
        elif terms[1] == "or":
            answer = bool_or(terms[0], terms[2])
            print(answer)
        # 输入的三个词格式不对
        else:
            print("input wrong!")
    elif len(terms) == 4:
        # A and not B
        if terms[1] == "and" and terms[2] == "not":
            answer = bool_and_not(terms[0],terms[3])
            print(answer)
        else:
            print("input wrong!")
    else:
        print("input wrong!")

🐇调试结果

未排序时得出的错误结果，主要就是and出错

在这里插入图片描述

📚问题记录

[nltk_data] Error loading punkt: [WinError 10060]
[nltk_data]由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。
False

在这里插入图片描述

参考博客：

最终解决方案：将下载的nltk_data文件放到如下位置，最终解决。
在这里插入图片描述

相关文件链接

最终代码

import sys
import bisect
from collections import defaultdict
from textblob import TextBlob
from textblob import Word

uselessTerm = ["text", "tweetid"]
postings = defaultdict(dict)

#分词预处理
def tokenize_tweet(document):
    # 统一处理使查询不区分大小写
    document = document.lower()
    # 根据特定标记在推特文档中查找并确定关键部分信息的位置索引
    # 这里的减1减3是对引号逗号切入与否的调整
    a = document.index("tweetid") - 1
    b = document.index("errorcode") - 1
    c = document.index("text") - 1
    d = document.index("timestr") - 3
    # 将推特文档中的tweetid和text内容主要信息提取出来
    document = document[a:b] + document[c:d]
    # print(document)
    # 分词处理，并将单词转换为其单数形式
    terms = TextBlob(document).words.singularize()
    # print(terms)
    # 将分词后的词列表进行词形还原，并筛选出不属于无用词的有效词
    result = []
    for word in terms:
        # 将当前词转换为Word对象
        expected_str = Word(word)
        # 动词的还原操作
        expected_str = expected_str.lemmatize("v")
        if expected_str not in uselessTerm:
            # 筛去["text", "tweetid"]，添加到result中
            result.append(expected_str)
    # print(result)
    return result

# 构建倒排索引表,键为有效词，对应的值是包含该词的推特文档的tweetid的列表
def get_postings():
    global postings
    content = open(r"Tweets.txt")
    # 内容读取，每一条推特作为一个元素存储在lines中
    lines = content.readlines()
    for line in lines:
        # 预处理
        line = tokenize_tweet(line)
        # 提取处理后的词列表中的第一个元素，即推特文档的tweetid
        tweetid = line[0]
        # 提取后删除，不作为有效词
        line.pop(0)
        # 将处理后的词列表转换为集合，获取其中的唯一词
        unique_terms = set(line)
        # 遍历每个词项
        for key_word in unique_terms:
            if key_word in postings.keys():
                # 如果当前词已经存在于postings的键中
                # 将当前推特文档的tweetid插入有序位置
                bisect.insort(postings[key_word], tweetid)

            else:
                # 如果当前词不存在
                # 创建一个新的键值对
                postings[key_word] = [tweetid]
    # data = open("postings.txt", 'w', encoding="utf-8")
    # print(postings, file=data)

# 以下开始布尔运算
# and运算
# 找出既包含第一个词项又包含第二个词项的推特文档。
def bool_and(term1, term2):
    global postings
    answer = []
    # 如果其中任一词项不在键中，则返回空列表
    if (term1 not in postings) or (term2 not in postings):
        return answer
    # 如果两个词项都存在于postings的键中
    else:
        # 获取倒排索引表两个词项对应的列表的长度
        length_1 = len(postings[term1])
        length_2 = len(postings[term2])
        # 初始化两个变量x和y，分别表示两个词项的索引
        x = 0
        y = 0
        while x < length_1 and y < length_2:
            # 如果当前两个词项对应列表中的元素相等，则表示该推特文档既包含term1又包含term
            if postings[term1][x] == postings[term2][y]:
                # 将当前推特文档的tweetid添加到结果列表
                answer.append(postings[term1][x])
                # 递增索引x和y，以继续比较下一个元素
                x += 1
                y += 1
            # term1词项对应列表中的元素小于term2词项对应列表中的元素
            elif postings[term1][x] < postings[term2][y]:
                # 递增索引x，以继续比较下一个元素
                x += 1
            else:
                # 反之，递增索引y
                y += 1
        return answer

# or运算
# 找出包含至少一个词项的推特文档
def bool_or(term1, term2):
    answer = []
    if (term1 not in postings) and (term2 not in postings):
        answer = []
    elif term2 not in postings:
        answer = postings[term1]
    elif term1 not in postings:
        answer = postings[term2]
    else:
        # 如果两个词项都存在于postings的键中,合并两个词项的倒排索引列表
        # 先把term1的都存上
        answer = postings[term1]
        for item in postings[term2]:
            # 然后在term2里查漏补缺
            if item not in answer:
                answer.append(item)
    return answer

# not操作
# 返回不包含词项term1的全部推特文档
def bool_not(term1):
    answer = []
    # 遍历所有的键
    for key in postings.keys():
        # 如果键不等于term1，并且对应的倒排索引列表中的文档不在postings[term1]中，则将该文档添加到结果列表中
        if key != term1 and all(doc not in postings[term1] for doc in postings[key]):
            answer += postings[key]
    # 去除重复的推特文档
    answer = list(set(answer))
    return answer

# and_not
# 找出包含第一个词项但不包含第二个的推特文档
def bool_and_not(term1, term2):
    temp = []
    # 如果词项term1不在postings的键中，则直接返回空列表
    if term1 not in postings:
        return temp
    # 如果词项term2不在，则将term1对应的倒排索引列表直接返回。
    elif term2 not in postings:
        temp = postings[term1]
        return temp
    # 如果两个词项都存在于postings的键中
    else:
        # 存term1列表
        temp = postings[term1]
        answer = []
        for tweetid in temp:
            # 要是term1有的term2没有就存到结果列表里
            if tweetid not in postings[term2]:
                answer.append(tweetid)
        return answer

# 对输入的文本进行词法分析和标准化处理
def token(doc):
    # 将输入文本转换为小写字母，以便统一处理。
    doc = doc.lower()
    # 将文本拆分为单个词项，并尝试将词项转换为单数形式
    terms = TextBlob(doc).words.singularize()
    # 将分词后的词列表进行词形还原,返回结果列表result
    result = []
    for word in terms:
        expected_str = Word(word)
        expected_str = expected_str.lemmatize("v")
        result.append(expected_str)
    return result

# 检索函数
def do_search():
    terms = token(input("Search query >> "))
    if terms == []:
        sys.exit()
    # 搜索的结果答案
    if len(terms) == 2:
        # not A
        if terms[0] == "not":
            answer = bool_not(terms[1])
            print(answer)
        else:
            print("input wrong!")
    elif len(terms) == 3:
        # A and B
        if terms[1] == "and":
            answer = bool_and(terms[0], terms[2])
            print(answer)
        # A or B
        elif terms[1] == "or":
            answer = bool_or(terms[0], terms[2])
            print(answer)
        # 输入的三个词格式不对
        else:
            print("input wrong!")
    elif len(terms) == 4:
        # A and not B
        if terms[1] == "and" and terms[2] == "not":
            answer = bool_and_not(terms[0],terms[3])
            print(answer)
        else:
            print("input wrong!")
    else:
        print("input wrong!")

def main():
    get_postings()
    while True:
        do_search()

if __name__ == "__main__":
    main()