Probabilistic information retrieval
Abstract
This chapter mainly introduces probabilistic approach to information retrieval, which provides a different formal basis for a retrieval model and results in different techniques for setting term weights.
Basic probability theory
- CHAIN RUlE
- PATITION RULE
- BAYES RULE
PRP
The probability model is used to estimate the correlation probability p(R = 1 | d, q) of each document and query, and then the results are ordered
In short, if the document is returned in terms of the probability associated with the query, the result is the best of all possible results.
In other words, If the document is returned according to the correlation probability of the query, and these correlation probabilities can be estimated as accurately as possible based on the known data, then the returned result is the best among all the possible results.
0/1 loss case
If an nonrelevant document is returned or a relevant document is not returned successfully, 1 point will be lost (in calculating the accuracy rate, this situation based on binary value is often called 1/0 risk).
The goal of retrieval is to return the top k documents with the highest possibility as the output of the results for any given K values. That is, PRP wants to arrange all documents in descending order of P (R = 1 | d, q).
When a unordered document collection is returned instead of sorted results, the decision can be made based on the minimum risk of loss by using the Bayes optimal decision rule, that is, returns documents that are more likely to be relevant than nonrelevant.
retrieval costs
PRP says that if for a specific document d and for all documents d’ not yet retrieved
then d is the next document to be retrieved.
this is a formal model to help us model differential costs of false positives and false negatives and even system performance issues at the modeling stage.
BIM
Binary Independence Model.
BIM model calculates the conditional probability P(R=1|d,q) expansion by Bayes formula.
Here, P(x|R = 1, q) and P(x|R = 0, q) are the probability that if a relevant or nonrelevant, respectively, document is retrieved, then that document’s representation is x.
Note:
- Define Ranking Function RSV(Q, D)
The resulting quantity used for ranking is called the retrieval status value (RSV) in this model:
let pt = P(xt = 1|R = 1, q) be the probability of a term appearing in a document relevant to the query, and ut = P(xt = 1|R = 0, q) be the probability of a term appearing in a nonrelevant document.
These quantities can be visualized in the following contingency table where the columns add to 1:
- Define ct
The ct terms are log odds ratios for the terms in the query.
- BIM Formula Derivation
Think of D as this formula:
P(ti|R=1) is the probability of ti appearing in the document in the relevant case.
Note:
The probability of not appearing in the set of related documents, so the sum of P(ti|R=1) is not 1
Probability estimates in theory
This if a contingency table of counts of documents in the collection, where d f t df_t dft is the number of documents that contain term t:
pt = s/S and ut = (dft − s)/(N − S)
To avoid the possibility of zeroes, it is fairly standard to add 1/2 to each of the quantities.
Probability estimates in practice
Under the assumption that relevant documents are a very small percentage of the collection, it is plausible to approximate statistics for nonrelevant documents by statistics from the whole collection.
so the probability of term occurrence in nonrelevant documents for a query, ut, is d f t df_t dft/N and
In terms of p t p_t pt, is can be estimated in several ways:
- use the frequency of term occurrence in known relevant documents
- use a constant in their combination
match model
For instance, we might assume that pt is constant over all terms x t x_t xt in the query and that p t p_t pt = 0.5. This means that each term has even odds of appearing in a relevant document, and so the pt and (1 − pt) factors cancel out in the expression for RSV.
- a much better estimate is found by simply estimating
pt from collection level statistics about the occurrence of t, as p t p_t pt = d f t / N df_t/N dft/N.
Probabilistic approaches to relevance feedback
The RF works as follows:
So, there are 5 steps.
- 1-4 steps:
In practice, it needs process of smoothing,
However, the set of documents judged by the user (V) is usually very small, and so the resulting statistical estimate is quite unreliable (noisy), even if the estimate is smoothed. So it is often better to combine the new information with the original guess in a process of Bayesian updating.
In this case we have:
- 5th step
Major Assumptions
- BIM
- a Boolean representation of documents/queries/relevance
- term independence
- terms not in the query don’t affect the outcome
- document relevance values are independent
Tree-structured dependencies between terms
Some of the assumptions of the BIM can be removed
a term xi is directly dependent on a term xk if there is an arrow xk → xi.
BM25
The simplest score for document d is just idf weighting of the query terms present
an alternative idf formulation as follows:
If a term occurs in over half the documents in the collection, then this model gives a negative term weight, which is presumably undesirable.
consider the frequency of each term and document length:
This is appropriate if the queries are paragraph-long information needs, but unnecessary for short queries.
If we have relevance judgments available, then we can use the full form in place of the approximation log(N/dft) introduced:
- Implement in Python
class BM25(object):
def __init__(self, docs):
self.D = len(docs)
self.avgdl = sum([len(doc)+0.0 for doc in docs]) / self.D
self.docs = docs
self.f = [] # 列表的每一个元素是一个dict,dict存储着一个文档中每个词的出现次数
self.df = {} # 存储每个词及出现了该词的文档数量
self.idf = {} # 存储每个词的idf值
self.k1 = 1.5
self.b = 0.75
self.init()
def init(self):
for doc in self.docs:
tmp = {}
for word in doc:
tmp[word] = tmp.get(word, 0) + 1 # 存储每个文档中每个词的出现次数
self.f.append(tmp)
for k in tmp.keys():
self.df[k] = self.df.get(k, 0) + 1
for k, v in self.df.items():
self.idf[k] = math.log(self.D-v+0.5)-math.log(v+0.5)
def sim(self, doc, index):
score = 0
for word in doc:
if word not in self.f[index]:
continue
d = len(self.docs[index])
score += (self.idf[word]*self.f[index][word]*(self.k1+1)
/ (self.f[index][word]+self.k1*(1-self.b+self.b*d
/ self.avgdl)))
return score
def simall(self, doc):
scores = []
for index in range(self.D):
score = self.sim(doc, index)
scores.append(score)
return scores
代码引自:
https://www.jianshu.com/p/1e498888f505
Bayesian network approaches to information retrieval
The model decomposes into two parts: a document collection network and a query network.
The concepts are a thesaurus-based expansion of the terms appearing in the document.
The query network maps from query terms, to query subexpressions to the user’s information need.