信息检索导论第十一章笔记(英文)

最新推荐文章于 2021-07-09 14:48:15 发布

Braylon1002

最新推荐文章于 2021-07-09 14:48:15 发布

阅读量539

点赞数 1

分类专栏：数据挖掘文章标签：信息检索导论

本文链接：https://blog.csdn.net/qq_40742298/article/details/108478873

版权

数据挖掘专栏收录该内容

53 篇文章 11 订阅

订阅专栏

文章目录

Probabilistic information retrieval

Probabilistic information retrieval

Abstract

This chapter mainly introduces probabilistic approach to information retrieval, which provides a different formal basis for a retrieval model and results in different techniques for setting term weights.

Basic probability theory

CHAIN RUlE

在这里插入图片描述

PATITION RULE

在这里插入图片描述

BAYES RULE

在这里插入图片描述

PRP

The probability model is used to estimate the correlation probability p(R = 1 | d, q) of each document and query, and then the results are ordered

In short, if the document is returned in terms of the probability associated with the query, the result is the best of all possible results.

In other words, If the document is returned according to the correlation probability of the query, and these correlation probabilities can be estimated as accurately as possible based on the known data, then the returned result is the best among all the possible results.

0/1 loss case

If an nonrelevant document is returned or a relevant document is not returned successfully, 1 point will be lost (in calculating the accuracy rate, this situation based on binary value is often called 1/0 risk).

The goal of retrieval is to return the top k documents with the highest possibility as the output of the results for any given K values. That is, PRP wants to arrange all documents in descending order of P (R = 1 | d, q).
When a unordered document collection is returned instead of sorted results, the decision can be made based on the minimum risk of loss by using the Bayes optimal decision rule, that is, returns documents that are more likely to be relevant than nonrelevant.

retrieval costs

PRP says that if for a specific document d and for all documents d’ not yet retrieved

在这里插入图片描述

then d is the next document to be retrieved.

this is a formal model to help us model differential costs of false positives and false negatives and even system performance issues at the modeling stage.

BIM

Binary Independence Model.

BIM model calculates the conditional probability P(R=1|d,q) expansion by Bayes formula.

在这里插入图片描述

Here, P(x|R = 1, q) and P(x|R = 0, q) are the probability that if a relevant or nonrelevant, respectively, document is retrieved, then that document’s representation is x.

Note:

在这里插入图片描述

Define Ranking Function RSV(Q, D)

The resulting quantity used for ranking is called the retrieval status value (RSV) in this model:

在这里插入图片描述

let pt = P(xt = 1|R = 1, q) be the probability of a term appearing in a document relevant to the query, and ut = P(xt = 1|R = 0, q) be the probability of a term appearing in a nonrelevant document.
These quantities can be visualized in the following contingency table where the columns add to 1:

在这里插入图片描述

Define ct

在这里插入图片描述

The ct terms are log odds ratios for the terms in the query.

BIM Formula Derivation

Think of D as this formula:

在这里插入图片描述

P(ti|R=1) is the probability of ti appearing in the document in the relevant case.

Note:
The probability of not appearing in the set of related documents, so the sum of P(ti|R=1) is not 1

Probability estimates in theory

This if a contingency table of counts of documents in the collection, where $df_t$ is the number of documents that contain term t:

在这里插入图片描述

pt = s/S and ut = (dft − s)/(N − S)

To avoid the possibility of zeroes, it is fairly standard to add 1/2 to each of the quantities.

在这里插入图片描述

Probability estimates in practice

Under the assumption that relevant documents are a very small percentage of the collection, it is plausible to approximate statistics for nonrelevant documents by statistics from the whole collection.

so the probability of term occurrence in nonrelevant documents for a query, ut, is $df_t$ /N and

在这里插入图片描述

In terms of $p_t$ , is can be estimated in several ways:

use the frequency of term occurrence in known relevant documents
use a constant in their combination
match model

For instance, we might assume that pt is constant over all terms $x_t$ in the query and that $p_t$ = 0.5. This means that each term has even odds of appearing in a relevant document, and so the pt and (1 − pt) factors cancel out in the expression for RSV.

a much better estimate is found by simply estimating
pt from collection level statistics about the occurrence of t, as $p_t$ = $df_t/N$ .

Probabilistic approaches to relevance feedback

The RF works as follows:

在这里插入图片描述

So, there are 5 steps.

1-4 steps:

在这里插入图片描述

In practice, it needs process of smoothing,

在这里插入图片描述

However, the set of documents judged by the user (V) is usually very small, and so the resulting statistical estimate is quite unreliable (noisy), even if the estimate is smoothed. So it is often better to combine the new information with the original guess in a process of Bayesian updating.

In this case we have:

在这里插入图片描述

5th step

Major Assumptions

a Boolean representation of documents/queries/relevance
term independence
terms not in the query don’t affect the outcome
document relevance values are independent

Tree-structured dependencies between terms

Some of the assumptions of the BIM can be removed

在这里插入图片描述

a term xi is directly dependent on a term xk if there is an arrow xk → xi.

BM25

The simplest score for document d is just idf weighting of the query terms present

在这里插入图片描述

an alternative idf formulation as follows:

在这里插入图片描述

If a term occurs in over half the documents in the collection, then this model gives a negative term weight, which is presumably undesirable.

consider the frequency of each term and document length:

在这里插入图片描述

This is appropriate if the queries are paragraph-long information needs, but unnecessary for short queries.

在这里插入图片描述

If we have relevance judgments available, then we can use the full form in place of the approximation log(N/dft) introduced:

Implement in Python

class BM25(object):

    def __init__(self, docs):
        self.D = len(docs)
        self.avgdl = sum([len(doc)+0.0 for doc in docs]) / self.D
        self.docs = docs
        self.f = []  # 列表的每一个元素是一个dict，dict存储着一个文档中每个词的出现次数
        self.df = {} # 存储每个词及出现了该词的文档数量
        self.idf = {} # 存储每个词的idf值
        self.k1 = 1.5
        self.b = 0.75
        self.init()

    def init(self):
        for doc in self.docs:
            tmp = {}
            for word in doc:
                tmp[word] = tmp.get(word, 0) + 1  # 存储每个文档中每个词的出现次数
            self.f.append(tmp)
            for k in tmp.keys():
                self.df[k] = self.df.get(k, 0) + 1
        for k, v in self.df.items():
            self.idf[k] = math.log(self.D-v+0.5)-math.log(v+0.5)

    def sim(self, doc, index):
        score = 0
        for word in doc:
            if word not in self.f[index]:
                continue
            d = len(self.docs[index])
            score += (self.idf[word]*self.f[index][word]*(self.k1+1)
                      / (self.f[index][word]+self.k1*(1-self.b+self.b*d
                                                      / self.avgdl)))
        return score

    def simall(self, doc):
        scores = []
        for index in range(self.D):
            score = self.sim(doc, index)
            scores.append(score)
        return scores

代码引自：
https://www.jianshu.com/p/1e498888f505

Bayesian network approaches to information retrieval

The model decomposes into two parts: a document collection network and a query network.
The concepts are a thesaurus-based expansion of the terms appearing in the document.
The query network maps from query terms, to query subexpressions to the user’s information need.

Braylon1002

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
信息检索导论第十一章笔记(英文)

文章目录Probabilistic information retrievalAbstractBasic probability theoryPRP0/1 loss caseretrieval costsBIMProbability estimates in theoryProbability estimates in practiceProbabilistic approaches to relevance feedbackMajor AssumptionsTree-structured depende
复制链接

扫一扫

专栏目录