信息检索与数据挖掘——倒排索引

最新推荐文章于 2023-02-22 01:06:20 发布

Soul fragments

最新推荐文章于 2023-02-22 01:06:20 发布

阅读量3.3k

点赞数 3

分类专栏：信息检索与数据挖掘文章标签：信息检索倒排索引布尔查询

本文链接：https://blog.csdn.net/weixin_43943977/article/details/102297275

版权

这篇博客介绍了信息检索实验中关于倒排索引和布尔查询的内容。通过对Tweets数据集进行预处理，构建倒排索引，并实现了Boolean Retrieval Model，支持and, or, not查询。实验详细描述了从数据预处理到建立索引，再到执行布尔查询的过程，并提供了部分代码示例和实验结果。" 109701784,8751694,JavaScript基本语法详解,"['JavaScript', '语法', 'Unicode', '编程基础']

摘要由CSDN通过智能技术生成

信息检索实验报告

[计算机][实验一]

实验题目

倒排索引与布尔查询

实验内容

对所给的Tweets数据集建立倒排索引；
实现Boolean Retrieval Model，使用TREC 2014 test topics进行测试；
Boolean Retrieval Model中支持and, or ,not，查询优化可选做；

实验过程

数据预处理

先来看一下初始数据格式：

在这里插入图片描述

数据集以推特为单位，每条推特上分为userName，clusterNo，text，timeStr，tweetId，errorCode，textCleaned，relevance属性。
我们的目的是构建倒排索引，需要的信息主要是userName,text，tweetId，所以在预处理过程中，我使用python将数据集以tweet为单位进行读取，并对字符串切片，完成对属性分割。

核心代码

  lines = f.readlines()
  for line in lines:
      line = line[tweetid:errorcode] + line[username:clusterno] + line[text:timestr] #预处理 切片，提取信息
      terms = TextBlob(line).words.singularize()#分词
      terms = terms.lemmatize("v")#单词变体还原

预处理后的文本如下所示，可以看到只保留了关键信息：

在这里插入图片描述

建立索引

新建一个列表postings，用于存放整个倒排索引，对处理后每一条tweet的每一个单词，将对应的tweedid增加到单词之后。

      #建立索引
      for word in terms:
          if word in postings.keys():
              postings[word].append(tweetid)
          else:
              postings[word] = [tweetid]

建立完成的索引部分如下所示：

在这里插入图片描述

布尔查询

单个布尔查询
首先判断所给term是否在postings中，如果在answer = postings[term]，否则，answer=[]
多个布尔查询
and/or联成的布尔查询，分开对每个单词进行查询，最后通过指针将多个查询id序列同时遍历，以线性的复杂度完成对多个查询的合并。
涉及3个或者3个以上的连接词时，同样可以先对每个单词进行查询，但两两合并时，可以优先选取长度较短的两个列表合并。
涉及not的查询，这里使用的是对已经查的列表的每个单词再次变量，删除在另一单词个列表中的id。

    for term in postings[term1]:
            if term not in postings[term2]:
                answer.append(term)

以部分TREC 2014 test数据为例，可以看到查询结果

在这里插入图片描述

在这里插入图片描述
所有代码：

import sys
from collections import defaultdict
from textblob import TextBlob
from textblob import Word

uselessTerm = ["username", "text", "tweetid"]
postings = defaultdict(dict)#inverted

def tokenize_tweet(document):
    document = document.lower()
    a = document.index("username")
    b = document.index("clusterno")
    c = document.rindex("tweetid") - 1
    d = document.rindex("errorcode")
    e = document.index("text")
    f = document.index("timestr") - 3
    #提取tweetid、username和tweet内容三部分主要信息
    document = document[c:d] + document[a:b] + document[e:f]#这里直接重新定义document了
    # print(document)
    terms = TextBlob(document).words.singularize()

    result = []#空列表
    for word in terms:
        expected_str = Word(word)
        expected_str = expected_str.lemmatize("v")#单词变体还原
        if expected_str not in uselessTerm:#这里还是去掉了无用单词
            result

最低0.47元/天解锁文章

Soul fragments

关注

3
点赞
踩
37

收藏

觉得还不错? 一键收藏
0
评论
信息检索与数据挖掘——倒排索引

信息检索实验报告[计算机][实验一]实验题目倒排索引与布尔查询实验内容对所给的Tweets数据集建立倒排索引；实现Boolean Retrieval Model，使用TREC 2014 test topics进行测试；Boolean Retrieval Model中支持and, or ,not，查询优化可选做；实验过程数据预处理先来看一下初始数据格式：数据集以...
复制链接

扫一扫

专栏目录