信息检索导论第九章笔记(英文)

Relevance feedback and query expansion

Abstract

This chapter mainly discuss the methods about how to do query refinement in IR system, either fully automatically or with the user in the loop.

And the methods for tackling the problem split into two major classes: global methods and local methods.

Global methods are techniques for expanding or reformulating query terms independent of the query and results returnedfrom it, so that changes in the query wording will cause the new query to match other semantically similar terms.

  • Global methods:
  1. Query expansion/reformulation with a thesaurus or WordNet
  2. Query expansion via automatic thesaurus generation
  3. Techniques like spelling correction

Local methods adjust a query relative to the documents that initially appearto match the query.

  1. Relevance feedback
  2. Pseudorelevance feedback, also known as blind relevance feedback
  3. Indirect relevance feedback

Relevance feedback and pseudo relevance feedback

  • Relevance feedback

In the process of information retrieval, user interaction is used to improve the final retrieval effect.

  • basic procedure

The user issues a (short, simple) query
The system returns a initial set of retrieval results
The user marks some returned documents as relevant or nonrelevant
The system computes a better representation of the information need based on the user feedback
The system displays a revised set of retrieval results

  • The Rocchio algorithm for relevance feedback

A query vector, denoted as q → \overrightarrow{q} q , that maximizes similarity with relevant documents while minimizing similaritywith nonrelevant documents.

Then, we wish to find:

在这里插入图片描述

Under cosine similarity, the optimal query vector q o p t → \overrightarrow{q_opt} qopt for separating the relevant and nonrelevant documents is:

在这里插入图片描述

That is, the optimal query is the vector difference between the centroids of therelevant and nonrelevant documents.

在这里插入图片描述

However, this q can not be obtained directly, because the original purpose of searching is to find relevant documents, and all relevant documents are unknown in advance.

In a real IR situation, suppose that we have a user query and partial knowledge of known relevant and nonrelevant documents. The algorithm proposes using the modified query q m → \overrightarrow{q_m} qm :

在这里插入图片描述

where q0 is the original query vector; DrandDnrare the set of known relevant and nonrelevant documents, respectively; and α,β,and γ are weights attached to each term.

Obviously, the modified new query starts with q 0 → \overrightarrow{q_0} q0 . At the same tima, it’s a distance closer to the centroid vector of the relevant document and it is far away from the centroid vector of unrelated documents.

Summary:
Relevance feedback can improve the accuracy and recall rate at the same time.

  • example of relevance feedback

Query:
New space satellite applications

user marks relevant doc
在这里插入图片描述

expand the query with weights:
在这里插入图片描述

return top results:
在这里插入图片描述

Probabilistic relevance feedback

If we know some relevant and unrelevant documents, we can build a classifier instead of modifying the weight of the query vector for relevance feedback. One way to implement classifier is to use naive Bayesian probability model, so that the probability of term t appearing in the document can be estimated according to the relevance of the document.

在这里插入图片描述

where N is the total number of documents, dft is the number that contain t, VR is the set of known relevant documents, and VRt is the subset of this set containing t.

When does relevant feedback work

  1. Users need enough knowledge to create a good initial query.
  2. Relevant feedback requires that the relevant documents are very similar. Rocchio
    relevance feedback model implicitly regards related documents as a single cluster by calculating the cluster centroid vector.
    If the related documents include several different subclasses, that is, they can be clustered into multiple clusters in vector space, the Rocchio method will not work well.

Cases where RF alone is not sufficient include:

在这里插入图片描述

Web search rarely uses relevance feedback, and most users want to complete the search task in one interaction.

Evaluation of relevance of feedback strategies

  • first idea

Start with an initial query q0 and to compute a precision–recall graph. After one round of feedback from the user, we compute the modified queryqmand again computea precision–recall graph.

However, the gains are partly due to the fact that known relevant documents (judged by the user)are now ranked higher.

  • second idea

use documents in the residual collection for the secound round of evaluation.

Unfortunately, the measured per-formance can then often be lower than for the original query. This is particularly the case if there are few relevant documents.

  • third method

Have two collections, one that is used for the initial query and relevance judgments, and the second that is then used for comparative evaluation.
The performance of both q0 and qm can be validly compared on the second collection.

Pseudo relevance feedback

Pseudo relevance feedback, also known as blind relevance feedback, automates the manual operation of relevance feedback.

Therefore, unlike Rocchio algorithm, users do not need to carry out additional interaction.

  1. Firstly, the normal retrieval process is carried out, and the most relevant documents are returned to form the initial set.
  2. Then, it is assumed that the top k documents are related.
  3. Finally, relevant feedback is carried out on this assumption as before.

Indirect relevance feedback

Using indirect resources instead of significant feedback results as the basis of feedback, this method is called implicit relevance feedback.
The web search engine DirectHit introduces an idea of document sorting, that is, for a document, the more times a user browses, the higher its ranking.

Global methods for query reformulation

Mainly about global methods for expanding a query:

Buy simply aiding the user in doing so, by using a manual thesaurus,and through building a thesaurus automatically.

  • what is query expansion?

let’s see an example in Yahoo!

在这里插入图片描述

  • Methods for building a thesaurus for query expandion include the following:
  1. Use of a controlled vacabulary that is maintained by human editors.

examples of query expandsion via the PubMed thesaurus.
在这里插入图片描述

  1. A manual thesaurus.
  2. An automatically derived thesaurus.
  3. Query reformulations based on query log mining.
  • Advantages(Thesaurus-based query expansion)

not requiring any user input. Use of query expansion generally increases recal and is widely used in many science and engineering fields.

Antomatic thesaurus generation

As an alternative to the cost of a manual thesaurus, we could attempt togenerate a thesaurus automatically by analyzing a collection of documents.

There are two main approaches.

  1. One is simply to exploit word cooccurrence
  2. The other one is to use a shallow grammatical analysis of the text and to exploit grammatical relations or grammatical depandencies.
  • example of an automatically generated thesaurus

在这里插入图片描述

All in all, the most common method of query expansion is to analyze some kind of synonym dictionary. For each query term t in the query, find the synonym or related words of T in the dictionary to automatically expand the query. Overall, query expansion is less successful than RF, although it may be as good as pseudo RF. It does, however, have the advantage of being much more understandable to the system user.

记录了我学习manning一书的部分笔记。
大家共勉

  • 4
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值