Abstract
According to content of past five chapters, we have got the results of matching documents. Nervertheless, in the case of large document collections, the resulting number of matching documents can far exceed the number a human user could possibly sift through. How to compute a score with respect to the query at hand to rank results is the main problem.
In section 1, it introduces parametric and zone indexes. Next, we develop the idea of weighting the importance of a term in a document, based on the statistics of the occurrence of the term. What’s more, another view, vector space scoring is used to compute a score between a query and each document. In section 4, experts develop several variants of term-weighting for the vector space model.
Parametric and zone indexes
In addition to text, documents also have metadata, such as creation time, document title, and so on. Therefore, we can also restrict them.
Parametric Index:
There are certain restrictions on the value of a field, such as the limit of the value range.
For example, the query results of documents must be published in 2020.
Find documents authored by William Shakespeare in 1601, containing the phrase alas poor Yorick.
Zones are similar to fields, except the contents of a zone can be arbitrary free text.
For instance,
document titles and abstracts are generally treated as zones.
Note:
the dictionary for a parametric index comes from a fixed vocabulary, the dictionary for azone index must strcture whatever vocabulary stems from the text of that zone.
- Reduce the space complexity
we can reduce the size of the dictionary by encoding the zone in which a term occurs in the postings.
Weighted zone scoring
A document has zone f1, f2, f3, and each zone has different weights respectively. For each zone, there is a corresponding weight w1, w2, w3.
Thus, the weighted zone score is defined to be:
S(q,d)=w1s(q,f1)+w2s(q,f2)+w3*s(q,f3)
In the book:
Weighted zone scoring is sometimes referred to also as ranked Boolean retrieval.
- compute weighted zone scores directly from inverted indexes
There is a obviously different place when we find docID(p1)==docID(p2), we would use the equation above to compute the weighted scores. All in all, there is the close similarity between this algorithm and that in the formal chaper.
Learning weights
We now consider a simple case of weighted zone scoring, where each doc has a title zone and a body zone.
So the score formula:
- How to determine the constant g
A given training document dj and a given train quert qj are assessed by a human editor who delivers a relevance judgment r(dj, qj) that is either relevant or nonrelevant.
Define the error of the scoring function:
where we have quantized the editorial relevance judgment r(dj, qj) to 0 or 1.
The total error:
The problem of learning the constant g from the given training examples then reduces to picking the value of g that minimizes the total error.
- Optimizor
The total error from training examples for which ST(dj, qj) = 0 and SB(dj, qj) = 1 is:
By writing in similar fashion the error func is:
Right now, this problem has become the The traditional problem of looking for extreme values.
By differentiating the “total error func” with respect to g and setting the result to 0, it follows that the otimal value of g is:
Term Frequency
Generally, given a query, we iterate for each term and calculate the score of each document, and then add the scores of the query terms in a document to get the matching rate between the query and the document.
-
The weight of a word term in a document is the number of times it appears in the document.
-
Bag of word model: the order of word terms in the document is ignored, only the number of occurrences is concerned;
Compared with Boolean retrieval model, it has a great improvement.
- TF: the number of occurences of a word term appears in a document
TF means that the weight of a word item only depends on the number of times it appears. This will cause problems. For example, the word “the” appears many times in the document. TF is very high, but the word is not important at all;
NOTE:
For convenience, TF value is generally stored in the posting of inverted index, while IDF value is stored in dictionary.
If only TF is used as the weight, there would be some problems.
Therefore, we need to introduce other weight representation;
Inverse Document Frequency
DF: the number of documents of a word where it appears
IDF: the reciprocal of DF, when a word term appears more documents, the lower IDF value.
CF: the number of occurrences of a word term in document collections
N: the total number of documents in a collection
- TF-IDF
the score of each document is obtained by the sum of the scores of each word in the query.
the vector space model for scoring
- Main Idea
The document is regarded as a sequence of terms, that is, vector.
Form the term-document matrix, in which each value is the weight of a specific term corresponding to a specific document, indicating the importance of the term in the document.
Euclidean normalization:
given (N1, N2, N3)
for this vector, it is (N1 / N, N2 / N, N3 / N); where n is the Euclidean length;
Length nomalization:
given (N1, N2, N3)
for this vector, it is (N1 / |N1|, N2 / |N2|, N3 / |N3|); where n is the vector length, respectively;
Similarity
- Construct each document into vector
- Do normalization respectively
- Vector dot product
Queries as vectors
Construct the query as a vector, and the document set constitutes the document vectors.
For the query vector, each value represents the weight of the term in the query, which can be weighted by TF or DF or TF-IDF.
For instance, pic from baidu.com
We can find from the figure:
the weight of the query is TF-IDF normalized.
the weight of document vector is TF value normalized.
Finally, the product of two vectors is the score of (Q, d).
In the above figure, the query vector((0,1.3,2.0,3.0)) is obtained by calculating the weight.
Note:
for example:
if the query vector is “auto car”, the final query weight may not be (1, 0, 1, 0)
if TF is not used and DF is used only, the weight of query vector will be (2.3, 1.3, 2.0, 3.0);
- Computing vector space score
Variant TF-IDF functions
sublinear tf scaling
It seems unlikely that twenty occurrences of a term in a document truly carrytwenty times the significance of a single occurrence.
In this form, replace tf by wf, to obtain:
Maximum tf normalization
the term a is a smoothing term whose role is t damp the contribution of the secoud term, which may be viewed as a scaling down of tf by the largest tf value in d.
- Maximum tf normalization dos suffer from the following issues:
- Unstable in the following sense: stop word list changes
- contain an outlier term with an unusually large number of occurrences of that term
- More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.
Document and query weighting schemes
where the first triplet gives the term weighting of the document vector and the second triplet gives the weighting in the query vector. The first letter in each triplet specifies the term frequency component of the weighting, the second the document frequency component, and the third the form of normalization used.
Pivoted normalized document length
Longer docs can broadly be lumped into two categories
- verbose documents that essentially repeat the same content – in these, the length of the document does not alter the relative weights of different terms
- documents covering multiple different topics, the relative weights f terms are quite different from a single short document that matches the query terms.
Notice the following aspects of the thick line representing pivoted length normalizaiton:
这里作为自己的笔记和总结,借鉴了manning原书还有csdn博主:
- iteye_17686:(传送门)
英语难免有些错误大家见谅。
大家共勉~~