信息检索导论第十一章笔记(英文)

Probabilistic information retrieval

Abstract

This chapter mainly introduces probabilistic approach to information retrieval, which provides a different formal basis for a retrieval model and results in different techniques for setting term weights.

Basic probability theory

  • CHAIN RUlE

在这里插入图片描述

  • PATITION RULE

在这里插入图片描述

  • BAYES RULE

在这里插入图片描述

PRP

The probability model is used to estimate the correlation probability p(R = 1 | d, q) of each document and query, and then the results are ordered

In short, if the document is returned in terms of the probability associated with the query, the result is the best of all possible results.

In other words, If the document is returned according to the correlation probability of the query, and these correlation probabilities can be estimated as accurately as possible based on the known data, then the returned result is the best among all the possible results.

0/1 loss case

If an nonrelevant document is returned or a relevant document is not returned successfully, 1 point will be lost (in calculating the accuracy rate, this situation based on binary value is often called 1/0 risk).

The goal of retrieval is to return the top k documents with the highest possibility as the output of the results for any given K values. That is, PRP wants to arrange all documents in descending order of P (R = 1 | d, q).
When a unordered document collection is returned instead of sorted results, the decision can be made based on the minimum risk of loss by using the Bayes optimal decision rule, that is, returns documents that are more likely to be relevant than nonrelevant.

retrieval costs

PRP says that if for a specific document d and for all documents d’ not yet retrieved

在这里插入图片描述

then d is the next document to be retrieved.

this is a formal model to help us model differential costs of false positives and false negatives and even system performance issues at the modeling stage.

BIM

Binary Independence Model.

BIM model calculates the conditional probability P(R=1|d,q) expansion by Bayes formula.

在这里插入图片描述

Here, P(x|R = 1, q) and P(x|R = 0, q) are the probability that if a relevant or nonrelevant, respectively, document is retrieved, then that document’s representation is x.

Note:

在这里插入图片描述

  • Define Ranking Function RSV(Q, D)

The resulting quantity used for ranking is called the retrieval status value (RSV) in this model:

在这里插入图片描述

let pt = P(xt = 1|R = 1, q) be the probability of a term appearing in a document relevant to the query, and ut = P(xt = 1|R = 0, q) be the probability of a term appearing in a nonrelevant document.
These quantities can be visualized in the following contingency table where the columns add to 1:

在这里插入图片描述

  • Define ct

在这里插入图片描述

The ct terms are log odds ratios for the terms in the query.

  • BIM Formula Derivation

Think of D as this formula:

在这里插入图片描述

P(ti|R=1) is the probability of ti appearing in the document in the relevant case.

Note:
The probability of not appearing in the set of related documents, so the sum of P(ti|R=1) is not 1

Probability estimates in theory

This if a contingency table of counts of documents in the collection, where d f t df_t dft is the number of documents that contain term t:

在这里插入图片描述

pt = s/S and ut = (dft − s)/(N − S)
在这里插入图片描述

To avoid the possibility of zeroes, it is fairly standard to add 1/2 to each of the quantities.

在这里插入图片描述

Probability estimates in practice

Under the assumption that relevant documents are a very small percentage of the collection, it is plausible to approximate statistics for nonrelevant documents by statistics from the whole collection.

so the probability of term occurrence in nonrelevant documents for a query, ut, is d f t df_t dft/N and

在这里插入图片描述

In terms of p t p_t pt, is can be estimated in several ways:

  1. use the frequency of term occurrence in known relevant documents
  2. use a constant in their combination
    match model

For instance, we might assume that pt is constant over all terms x t x_t xt in the query and that p t p_t pt = 0.5. This means that each term has even odds of appearing in a relevant document, and so the pt and (1 − pt) factors cancel out in the expression for RSV.

  1. a much better estimate is found by simply estimating
    pt from collection level statistics about the occurrence of t, as p t p_t pt = d f t / N df_t/N dft/N.

Probabilistic approaches to relevance feedback

The RF works as follows:

在这里插入图片描述

So, there are 5 steps.

  • 1-4 steps:

在这里插入图片描述

In practice, it needs process of smoothing,

在这里插入图片描述

However, the set of documents judged by the user (V) is usually very small, and so the resulting statistical estimate is quite unreliable (noisy), even if the estimate is smoothed. So it is often better to combine the new information with the original guess in a process of Bayesian updating.

In this case we have:

在这里插入图片描述

  • 5th step

Major Assumptions

  • BIM
  1. a Boolean representation of documents/queries/relevance
  2. term independence
  3. terms not in the query don’t affect the outcome
  4. document relevance values are independent

Tree-structured dependencies between terms

Some of the assumptions of the BIM can be removed

在这里插入图片描述

a term xi is directly dependent on a term xk if there is an arrow xk → xi.

BM25

The simplest score for document d is just idf weighting of the query terms present

在这里插入图片描述

an alternative idf formulation as follows:

在这里插入图片描述

If a term occurs in over half the documents in the collection, then this model gives a negative term weight, which is presumably undesirable.

consider the frequency of each term and document length:

在这里插入图片描述

This is appropriate if the queries are paragraph-long information needs, but unnecessary for short queries.

在这里插入图片描述

If we have relevance judgments available, then we can use the full form in place of the approximation log(N/dft) introduced:
在这里插入图片描述

  • Implement in Python
class BM25(object):

    def __init__(self, docs):
        self.D = len(docs)
        self.avgdl = sum([len(doc)+0.0 for doc in docs]) / self.D
        self.docs = docs
        self.f = []  # 列表的每一个元素是一个dict,dict存储着一个文档中每个词的出现次数
        self.df = {} # 存储每个词及出现了该词的文档数量
        self.idf = {} # 存储每个词的idf值
        self.k1 = 1.5
        self.b = 0.75
        self.init()

    def init(self):
        for doc in self.docs:
            tmp = {}
            for word in doc:
                tmp[word] = tmp.get(word, 0) + 1  # 存储每个文档中每个词的出现次数
            self.f.append(tmp)
            for k in tmp.keys():
                self.df[k] = self.df.get(k, 0) + 1
        for k, v in self.df.items():
            self.idf[k] = math.log(self.D-v+0.5)-math.log(v+0.5)

    def sim(self, doc, index):
        score = 0
        for word in doc:
            if word not in self.f[index]:
                continue
            d = len(self.docs[index])
            score += (self.idf[word]*self.f[index][word]*(self.k1+1)
                      / (self.f[index][word]+self.k1*(1-self.b+self.b*d
                                                      / self.avgdl)))
        return score

    def simall(self, doc):
        scores = []
        for index in range(self.D):
            score = self.sim(doc, index)
            scores.append(score)
        return scores

代码引自:
https://www.jianshu.com/p/1e498888f505

Bayesian network approaches to information retrieval

The model decomposes into two parts: a document collection network and a query network.
The concepts are a thesaurus-based expansion of the terms appearing in the document.
The query network maps from query terms, to query subexpressions to the user’s information need.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值