信息检索与数据挖掘——倒排索引2

最新推荐文章于 2022-03-15 05:35:27 发布

Soul fragments

最新推荐文章于 2022-03-15 05:35:27 发布

阅读量2.8k

点赞数 2

分类专栏：信息检索与数据挖掘文章标签：信息检索倒排索引

本文链接：https://blog.csdn.net/weixin_43943977/article/details/102621363

版权

信息检索实验报告

实验题目

Ranked retrieval model

实验内容

在 Homework 1.1的基础上实现最基本的 Ranked retrieval model;
Use SMART notation: lnc.ltc;
在Dictionary和posting list中存储每个term的DF；

实验过程

tf and df存储

添加其它需要的数据结构来分别记录DF：

postings = defaultdict(dict)
document_frequency = defaultdict(int)

遍历tweets时记录tf

unique_terms = set(line)
for term in unique_terms:
    postings[term][tweetid] = line.count(term)

记录tf完成后的输出文件如下：
在这里插入图片描述
记录df

for term in postings:
    document_frequency[term] = len(postings[term])

记录df完成后的输出文件如下：

在这里插入图片描述
注：这里的len是取term出现在文档的个数

查询实现
查询relevant_tweetids

利用已有数据结构实现查询的功能，对于输入的一串语句，进行相同的token处理，而后为了加快检索速率，避免去遍历所有tweet计算每一个F（q,d）,可以先对于查询检索提取出相关的tweetid,得到relevant_tweetids列表，然后对相关tweetid计算score，降序输出topk个相关tweets。

  unique_query = set(query)
  relevant_tweetids = Union([set(postings[term].keys()) for term in unique_query])

计算score

然后对query中的每一个键值对（term，id），计算score

    for term in unique_query:
        wtq=query.count(term)/len(query)

        if (term in postings) and (id in postings[term].keys()):
            wtd = (1 + math.log(postings[term][id]) * math.log((document_numbers + 1) / document_frequency[term]))
            similarity = wtq*wtd

排序

调用sorted函数对score进行排序

        scores = sorted([(id, Getscore(query, id) 
                          for id in relevant_tweetids],
                         key=lambda x: x[1],
                         reverse=True)

输出

输出前k个值

        for (id, score) in scores3:
            if i<=10:
                result.append(id)
                print(str(score) + ": " + id)
                i = i + 1
            else:
                break
        print

最低0.47元/天解锁文章

Soul fragments

关注

2
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
信息检索与数据挖掘——倒排索引2

信息检索实验报告实验题目Ranked retrieval model实验内容在 Homework 1.1的基础上实现最基本的 Ranked retrieval model;Use SMART notation: lnc.ltc;在Dictionary和posting list中存储每个term的DF；实验过程 tf and df存储添加其它需要的数据结构来分别记录DF：...
复制链接

扫一扫