信息检索导论第七章笔记(英文)

最新推荐文章于 2022-03-05 17:08:13 发布

Braylon1002

最新推荐文章于 2022-03-05 17:08:13 发布

阅读量744

点赞数 2

分类专栏：数据挖掘文章标签：信息检索

本文链接：https://blog.csdn.net/qq_40742298/article/details/108005748

版权

数据挖掘专栏收录该内容

53 篇文章 11 订阅

订阅专栏

文章目录

Abstract

Speed up scoring algorithm by using heuristics methods

Compromise in lower time complexity and better query matching scores.

Start consider more general problem of computing scores in a search engine
Outline a complete search engine, including indexes and structures to support not only cosine scoring, but also more general ranking factors. We will fit them together.
How the vector space model for free text queries interacts with common query operators.

Efficient scoring and ranking

In our previous scoring and sorting methods, we need to calculate the cosine scores of the query and each documents, and then we need to take out the top k documents with the highest score, which is very complicated.

query: q = jealous gossip

The unit vector v(q) has only two nonzero components.
In the absence of any weighting for query terms, these nonzero components are equal.

In fact, if there is an algorithm that can approximate the first k documents, but the complexity is very small (it does not need to calculate the scores of all documents), we usually use the latter algorithm:
General method: find document subset a (far smaller than the initial document set) in advance, including most candidate documents, and calculate the top k documents with the highest score in that.

The Formula:

在这里插入图片描述

So the fast cosine score algorithm:

在这里插入图片描述

In the Manning’s book, it is called the Inexact top K document retrieval, having the folling two-step scheme.

Find a setAof documents that are contenders, where K < |A| << N. A does not necessarily contain the K top-scoring documents for the query,but is likely to have many documents with scores near those of the top K.
Return the K top-scoring documents in A.

Index Elimination

Only consider the posting when the IDF of the term exceeds the threshold.

Because the term with low IDF is usually stop words and the posting is very long.
It will greatly reduce the complexity if these are not calculated, so it is unnecessary to consider them.

In this case, if the number of documents exceeding the threshold does not exceed K, it would be a good idea to solve the problem with hierarchical index.
Hierarchical index:
the inverted record table is layered.
For example, if the TF is more than 20, it is in the first level, and if the TF is more than 10 but lower than 20, it is in the second level. When you need to find the first k documents, you only need to search at the first level. If it doesn’t get enough K articles, we can go to the second level to add.

Therefore, hierarchical index is a way to solve the problem that fewer than k documents may be returned;

Only documents with multiple query terms are considered.

Champion lists

Champion list

for the word term t, the R documents with the highest TF value of posting are extracted in advance, which is called the champion list.

Given a query Q
only need to find the union of the champion lists of each term in Q. this union is the document subset a in the above general method, and calculates the cosine similarity in a for each documents.

Static quality scores and ordering

Each document has a static score g(d) unrelated to the query, and the posting in the inverted index is arranged in descending order according to g(d).

在这里插入图片描述

Obviously, g(3) > g(2) > g(1)

The final score is score(Q, d) = g(d) + V (q) V (d).

在这里插入图片描述

Note:
PageRank is a static quality score, which is based on web link analysis;

One further idea

For term T, two tables are maintained:

the high table (M documents with the highest TF value)
the low table (the rest), which are sorted by g(d).

The method to get the highest score of K documents:
first calculate the score of the high table, and if it has been able to take out the document with the highest score in the high table, the process finishes.
otherwise, the rest will be taken from the low table.

Cluster pruning

basic concept:

leader:

Pick √N documents at random from the collection. Call theseleaders

follower:

For each document that is not a leader, we compute its nearest leader.

在这里插入图片描述

Query Method:

Given the query Q, first calculate the cosine similarity score with each leader to find the nearest leader.
Document subset a is the follower corresponding to leader + leader

Components of an information retrieval system

Tiered indexes

I mentioned in “Static quality scores and ordering” section.

If there is a two-level tiered indexes, first calculate the score of the high table, and if it has been able to take out the document with the highest score in the high table, the process finishes.
otherwise, the rest will be taken from the low table.

在这里插入图片描述

Query term proximity

We hope the query words to be close to each other in the document, so that the document and the query can be more relevant, which is called a good search engine performance.

basic concepts

Smallest window:
in a document d that contains all the query terms, measured in the number of words in the window.

Soft conjunctive:
the document does not need to contain all the query terms, only need to include most of the query terms.

Consider a query

Consider a query with two or more query terms,t1,t2,…,tk. Let ω be the width of the smallest window in a documentdthat contains all the query terms, measured in the number of words in the window.

Intuitively,the smaller that ω is, the better that d matches the query. In cases where the document does not contain all of the query terms, we can set ω to be some enormous number.

Note:
Probably, it needs to add term similarity to the weights.

A Search engine

在这里插入图片描述

Indexer is used to generate various indexes, such as parametric index, domain index, k-gram index and hierarchical index.

The vector space model is different from the Boolean retrieval model.

The Boolean model only considers whether the term exists in the document, but does not consider how many times it appears without weights.

Designing parsing

Sometimes, this executioncan entail multiple queries against the underlying indexes; for example, the query parser may issue a stream of queries:

在这里插入图片描述

Each of these steps (if invoked) may yield a list of scored documents, this demands an aggregate scoring function.

Scoring function

The answer depends on the setting.

Such as

introduction of new document formats or document management systems, or a merger with another company.
incresingly commonly to use machine-learned scoring to address this, which will be discussed further in Section 15.

事件	概率(%)	时间边界
Speech	0.25657	–
Gunfire	0.11153	–
Car	2.69401	–
Crying	0.00156	–
Fire	75.87898	0.0-10.0
Fireworks	0.77396	–
Glass	0.10306	–
Screaming	0.04901	–
Explosion	36.80933	0.0-2.72, 8.34-10.0