信息检索导论第九章笔记(英文)_pseudo-relevance feedback-CSDN博客

本文链接：https://blog.csdn.net/qq_40742298/article/details/108006149

文章目录

Relevance feedback and query expansion

Relevance feedback and query expansion

Abstract

This chapter mainly discuss the methods about how to do query refinement in IR system, either fully automatically or with the user in the loop.

And the methods for tackling the problem split into two major classes: global methods and local methods.

Global methods are techniques for expanding or reformulating query terms independent of the query and results returnedfrom it, so that changes in the query wording will cause the new query to match other semantically similar terms.

Global methods:

Query expansion/reformulation with a thesaurus or WordNet
Query expansion via automatic thesaurus generation
Techniques like spelling correction

Local methods adjust a query relative to the documents that initially appearto match the query.

Relevance feedback
Pseudorelevance feedback, also known as blind relevance feedback
Indirect relevance feedback

Relevance feedback and pseudo relevance feedback

Relevance feedback

In the process of information retrieval, user interaction is used to improve the final retrieval effect.

basic procedure

The user issues a (short, simple) query
The system returns a initial set of retrieval results
The user marks some returned documents as relevant or nonrelevant
The system computes a better representation of the information need based on the user feedback
The system displays a revised set of retrieval results

The Rocchio algorithm for relevance feedback

A query vector, denoted as $\overrightarrow{q}$ , that maximizes similarity with relevant documents while minimizing similaritywith nonrelevant documents.

Then, we wish to find:

在这里插入图片描述

Under cosine similarity, the optimal query vector $\overrightarrow{q_opt}$ for separating the relevant and nonrelevant documents is:

在这里插入图片描述

That is, the optimal query is the vector difference between the centroids of therelevant and nonrelevant documents.

在这里插入图片描述

However, this q can not be obtained directly, because the original purpose of searching is to find relevant documents, and all relevant documents are unknown in advance.

In a real IR situation, suppose that we have a user query and partial knowledge of known relevant and nonrelevant documents. The algorithm proposes using the modified query $\overrightarrow{q_m}$ :

在这里插入图片描述

where q0 is the original query vector; DrandDnrare the set of known relevant and nonrelevant documents, respectively; and α,β,and γ are weights attached to each term.

Obviously, the modified new query starts with $\overrightarrow{q_0}$ . At the same tima, it’s a distance closer to the centroid vector of the relevant document and it is far away from the centroid vector of unrelated documents.

Summary:
Relevance feedback can improve the accuracy and recall rate at the same time.

example of relevance feedback

Query:
New space satellite applications

user marks relevant doc

expand the query with weights:

return top results:

Probabilistic relevance feedback

If we know some relevant and unrelevant documents, we can build a classifier instead of modifying the weight of the query vector for relevance feedback. One way to implement classifier is to use naive Bayesian probability model, so that the probability of term t appearing in the document can be estimated according to the relevance of the document.

在这里插入图片描述

where N is the total number of documents, dft is the number that contain t, VR is the set of known relevant documents, and VRt is the subset of this set containing t.

When does relevant feedback work

Users need enough knowledge to create a good initial query.
Relevant feedback requires that the relevant documents are very similar. Rocchio
relevance feedback model implicitly regards related documents as a single cluster by calculating the cluster centroid vector.
If the related documents include several different subclasses, that is, they can be clustered into multiple clusters in vector space, the Rocchio method will not work well.

Cases where RF alone is not sufficient include:

在这里插入图片描述

Web search rarely uses relevance feedback, and most users want to complete the search task in one interaction.

Evaluation of relevance of feedback strategies

first idea

Start with an initial query q0 and to compute a precision–recall graph. After one round of feedback from the user, we compute the modified queryqmand again computea precision–recall graph.

However, the gains are partly due to the fact that known relevant documents (judged by the user)are now ranked higher.

second idea

use documents in the residual collection for the secound round of evaluation.

Unfortunately, the measured per-formance can then often be lower than for the original query. This is particularly the case if there are few relevant documents.

third method

Have two collections, one that is used for the initial query and relevance judgments, and the second that is then used for comparative evaluation.
The performance of both q0 and qm can be validly compared on the second collection.

Pseudo relevance feedback

Pseudo relevance feedback, also known as blind relevance feedback, automates the manual operation of relevance feedback.

Therefore, unlike Rocchio algorithm, users do not need to carry out additional interaction.

Firstly, the normal retrieval process is carried out, and the most relevant documents are returned to form the initial set.
Then, it is assumed that the top k documents are related.
Finally, relevant feedback is carried out on this assumption as before.

Indirect relevance feedback

Using indirect resources instead of significant feedback results as the basis of feedback, this method is called implicit relevance feedback.
The web search engine DirectHit introduces an idea of document sorting, that is, for a document, the more times a user browses, the higher its ranking.

Global methods for query reformulation

Mainly about global methods for expanding a query:

Buy simply aiding the user in doing so, by using a manual thesaurus,and through building a thesaurus automatically.

what is query expansion?

let’s see an example in Yahoo!

在这里插入图片描述

Methods for building a thesaurus for query expandion include the following:

Use of a controlled vacabulary that is maintained by human editors.

examples of query expandsion via the PubMed thesaurus.

A manual thesaurus.
An automatically derived thesaurus.
Query reformulations based on query log mining.

Advantages(Thesaurus-based query expansion)

not requiring any user input. Use of query expansion generally increases recal and is widely used in many science and engineering fields.

Antomatic thesaurus generation

As an alternative to the cost of a manual thesaurus, we could attempt togenerate a thesaurus automatically by analyzing a collection of documents.

There are two main approaches.

One is simply to exploit word cooccurrence
The other one is to use a shallow grammatical analysis of the text and to exploit grammatical relations or grammatical depandencies.

example of an automatically generated thesaurus

在这里插入图片描述

All in all, the most common method of query expansion is to analyze some kind of synonym dictionary. For each query term t in the query, find the synonym or related words of T in the dictionary to automatically expand the query. Overall, query expansion is less successful than RF, although it may be as good as pseudo RF. It does, however, have the advantage of being much more understandable to the system user.