信息索引导论第六章笔记(英文)

最新推荐文章于 2022-11-02 21:57:31 发布

Braylon1002

最新推荐文章于 2022-11-02 21:57:31 发布

阅读量1k

点赞数 1

分类专栏：数据挖掘文章标签：信息检索

本文链接：https://blog.csdn.net/qq_40742298/article/details/108005515

版权

数据挖掘专栏收录该内容

53 篇文章 11 订阅

订阅专栏

Abstract

According to content of past five chapters, we have got the results of matching documents. Nervertheless, in the case of large document collections, the resulting number of matching documents can far exceed the number a human user could possibly sift through. How to compute a score with respect to the query at hand to rank results is the main problem.

In section 1, it introduces parametric and zone indexes. Next, we develop the idea of weighting the importance of a term in a document, based on the statistics of the occurrence of the term. What’s more, another view, vector space scoring is used to compute a score between a query and each document. In section 4, experts develop several variants of term-weighting for the vector space model.

Parametric and zone indexes

In addition to text, documents also have metadata, such as creation time, document title, and so on. Therefore, we can also restrict them.

Parametric Index:
There are certain restrictions on the value of a field, such as the limit of the value range.

For example, the query results of documents must be published in 2020.
Find documents authored by William Shakespeare in 1601, containing the phrase alas poor Yorick.

Zones are similar to fields, except the contents of a zone can be arbitrary free text.

For instance,
document titles and abstracts are generally treated as zones.

Note:
the dictionary for a parametric index comes from a fixed vocabulary, the dictionary for azone index must strcture whatever vocabulary stems from the text of that zone.

Reduce the space complexity

we can reduce the size of the dictionary by encoding the zone in which a term occurs in the postings.

在这里插入图片描述

Weighted zone scoring

A document has zone f1, f2, f3, and each zone has different weights respectively. For each zone, there is a corresponding weight w1, w2, w3.

Thus, the weighted zone score is defined to be:

S(q,d)=w1s(q,f1)+w2s(q,f2)+w3*s(q,f3)

In the book:

在这里插入图片描述

Weighted zone scoring is sometimes referred to also as ranked Boolean retrieval.

compute weighted zone scores directly from inverted indexes

在这里插入图片描述

There is a obviously different place when we find docID(p1)==docID(p2), we would use the equation above to compute the weighted scores. All in all, there is the close similarity between this algorithm and that in the formal chaper.

Learning weights

We now consider a simple case of weighted zone scoring, where each doc has a title zone and a body zone.

So the score formula:

How to determine the constant g

A given training document dj and a given train quert qj are assessed by a human editor who delivers a relevance judgment r(dj, qj) that is either relevant or nonrelevant.

在这里插入图片描述

Define the error of the scoring function:

在这里插入图片描述

where we have quantized the editorial relevance judgment r(dj, qj) to 0 or 1.

The total error:

在这里插入图片描述

The problem of learning the constant g from the given training examples then reduces to picking the value of g that minimizes the total error.

Optimizor

The total error from training examples for which ST(dj, qj) = 0 and SB(dj, qj) = 1 is:

在这里插入图片描述

By writing in similar fashion the error func is:

在这里插入图片描述

Right now, this problem has become the The traditional problem of looking for extreme values.

By differentiating the “total error func” with respect to g and setting the result to 0, it follows that the otimal value of g is:

在这里插入图片描述

Term Frequency

Generally, given a query, we iterate for each term and calculate the score of each document, and then add the scores of the query terms in a document to get the matching rate between the query and the document.

The weight of a word term in a document is the number of times it appears in the document.
Bag of word model: the order of word terms in the document is ignored, only the number of occurrences is concerned;

Compared with Boolean retrieval model, it has a great improvement.

TF: the number of occurences of a word term appears in a document

TF means that the weight of a word item only depends on the number of times it appears. This will cause problems. For example, the word “the” appears many times in the document. TF is very high, but the word is not important at all;

NOTE:
For convenience, TF value is generally stored in the posting of inverted index, while IDF value is stored in dictionary.
If only TF is used as the weight, there would be some problems.
Therefore, we need to introduce other weight representation;

Inverse Document Frequency

DF: the number of documents of a word where it appears

IDF: the reciprocal of DF, when a word term appears more documents, the lower IDF value.

CF: the number of occurrences of a word term in document collections

在这里插入图片描述

N: the total number of documents in a collection

TF-IDF

在这里插入图片描述

the score of each document is obtained by the sum of the scores of each word in the query.

在这里插入图片描述

the vector space model for scoring

Main Idea

The document is regarded as a sequence of terms, that is, vector.

Form the term-document matrix, in which each value is the weight of a specific term corresponding to a specific document, indicating the importance of the term in the document.

在这里插入图片描述

Euclidean normalization:
given (N1, N2, N3)
for this vector, it is (N1 / N, N2 / N, N3 / N); where n is the Euclidean length;

Length nomalization:
given (N1, N2, N3)
for this vector, it is (N1 / |N1|, N2 / |N2|, N3 / |N3|); where n is the vector length, respectively;

Similarity

在这里插入图片描述

Construct each document into vector
Do normalization respectively
Vector dot product

在这里插入图片描述

Queries as vectors

Construct the query as a vector, and the document set constitutes the document vectors.

For the query vector, each value represents the weight of the term in the query, which can be weighted by TF or DF or TF-IDF.

在这里插入图片描述

For instance, pic from baidu.com

在这里插入图片描述

We can find from the figure:

the weight of the query is TF-IDF normalized.
the weight of document vector is TF value normalized.
Finally, the product of two vectors is the score of (Q, d).

In the above figure, the query vector((0,1.3,2.0,3.0)) is obtained by calculating the weight.

Note:
for example:
if the query vector is “auto car”, the final query weight may not be (1, 0, 1, 0)
if TF is not used and DF is used only, the weight of query vector will be (2.3, 1.3, 2.0, 3.0);

Computing vector space score

在这里插入图片描述

Variant TF-IDF functions

sublinear tf scaling

在这里插入图片描述

It seems unlikely that twenty occurrences of a term in a document truly carrytwenty times the significance of a single occurrence.

In this form, replace tf by wf, to obtain:

在这里插入图片描述

Maximum tf normalization

在这里插入图片描述

the term a is a smoothing term whose role is t damp the contribution of the secoud term, which may be viewed as a scaling down of tf by the largest tf value in d.

Maximum tf normalization dos suffer from the following issues:

Unstable in the following sense: stop word list changes
contain an outlier term with an unusually large number of occurrences of that term
More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.

Document and query weighting schemes

在这里插入图片描述

where the first triplet gives the term weighting of the document vector and the second triplet gives the weighting in the query vector. The first letter in each triplet specifies the term frequency component of the weighting, the second the document frequency component, and the third the form of normalization used.

Pivoted normalized document length

Longer docs can broadly be lumped into two categories

verbose documents that essentially repeat the same content – in these, the length of the document does not alter the relative weights of different terms
documents covering multiple different topics, the relative weights f terms are quite different from a single short document that matches the query terms.

在这里插入图片描述

Notice the following aspects of the thick line representing pivoted length normalizaiton:

在这里插入图片描述

这里作为自己的笔记和总结，借鉴了manning原书还有csdn博主：

iteye_17686：（传送门）

英语难免有些错误大家见谅。
大家共勉~~

Braylon1002

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
信息索引导论第六章笔记(英文)

AbstractAccording to content of past five chapters, we have got the results of matching documents. Nervertheless, in the case of large document collections, the resulting number of matching documents can far exceed the number a human user could possibly s
复制链接

扫一扫

专栏目录