1298人阅读 评论(0)

# Score computation mechanism

I am learning Elasticsearch these days, so I’m really curious about how Elasticsearch compute the score of retrieved documents.

# Score Equation

Lucene’s Practical Scoring Function 中给出的score计算公式如下：

score(q,d)=queryNorm(q)coord(q,d)(tf(tind)idf(t)²t.getBoost()norm(t,d))(tinq) (1)

score(q,d) 是在每个field上分别计算的，然后求和（也取决于你是如何让Elasticsearch计算的）。

• Term frequency
tf(t in d) = √frequency
The term frequency (tf) for term t in document d is the square root of the number of times the term appears in the document.
其实tf是在field中进行统计的。

• Inverse document frequency
idf(t) = 1 + log ( numDocs / (docFreq + 1))

The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term.
IDF也是在field中进行计算的。

• Field-length norm
norm(d) = 1 / √numTerms

The field-length norm (norm) is the inverse square root of the number of terms in the field.

Elasticsearch并没有采用Vector Space Model, 因为计算文档的向量比较费时间，而是采用了结合Boolean Model, TF/IDF Model 和Vector Space Model三种相结合的方式进行score计算。

t.getBoost()的官方解释：t.getBoost()
In fact, reading the explain output is a little more complex than that. You won’t see the boost value or t.getBoost() mentioned in the explanation at all. Instead, the boost is rolled into the queryNorm that is applied to a particular term. Although we said that the queryNorm is the same for every term, you will see that the queryNorm for a boosted term is higher than the queryNorm for an unboosted term.

## Query Coordination

The coordination factor (coord) is used to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.

Imagine that we have a query for quick brown fox, and that the weight for each term is 1.5. Without the coordination factor, the score would just be the sum of the weights of the terms in a document. For instance:

Document with fox → score: 1.5
Document with quick fox → score: 3.0
Document with quick brown fox → score: 4.5
The coordination factor multiplies the score by the number of matching terms in the document, and divides it by the total number of terms in the query. With the coordination factor, the scores would be as follows:

Document with fox → score: 1.5 * 1 / 3 = 0.5
Document with quick fox → score: 3.0 * 2 / 3 = 2.0
Document with quick brown fox → score: 4.5 * 3 / 3 = 4.5
The coordination factor results in the document that contains all three terms being much more relevant than the document that contains just two of them.

## Query Normalization Factor

The query normalization factor (queryNorm) is an attempt to normalize a query so that the results from one query may be compared with the results of another.
queryNorm(q)的好处是使得不同的查询的结果的得分在同一个空间中，这个即使是不同的查询的结果也可以直接比较。

Even though the intent of the query norm is to make results from different queries comparable, it doesn’t work very well. The only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries.

This factor is calculated at the beginning of the query. The actual calculation depends on the queries involved, but a typical implementation is as follows:
queryNorm = 1 / √sumOfSquaredWeights
The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.
The same query normalization factor is applied to every document, and you have no way of changing it. For all intents and purposes, it can be ignored.

## queryNorm in Lucene

TFIDFSimilarity中，对于queryNorm的定义如下：

queryNorm(q)=queryNorm(sumOfSquaredWeights)=1sumOfSquaredWeights1/2

sumOfSquaredWeights=q.getBoost()2(idf(t)t.getBoost())2(t in q)

Lucene只有下面的解释：
t.getBoost() is a search time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to setBoost(). Notice that there is really no direct API for accessing a boost of one term in a multi term query, but rather multi terms are represented in a query as multi TermQuery objects, and so the boost of a term in the query is accessible by calling the sub-query getBoost().

GET /test/news/_search?explain
{
"query": {
"multi_match": {
"query": "apple iphone6",
"fields": ["title^3", "body^2"],
"type": "most_fields"
}
}
}

sumOfSquaredWeights=(1fieldBoost)2t inq(((idf(t)t.getBoost())2)(field in searchFields))

# 实例计算

PUT /test
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}

PUT /test/_mapping/news
{
"properties": {
"title": {
"type": "string",
"analyzer": "english"
},
"body": {
"type": "string",
"analyzer": "english"
},
"version": {
"type": "string",
"analyzer": "english"
}
}
}

PUT /test/news/1
{
"title": "apple released iphone",
"body": "last day, apple company has released their latest product iphone 6, which is the biggest ihpone in histroy"
}

PUT /test/news/2
{
"title": "microsoft suied apple",
"body": "microsoft told that apple has used many of their patents, apple need to pay for these patents for 12 billion"
}

GET /test/news/_search?explain
{
"query": {
"multi_match": {
"query": "apple iphone",
"fields": ["title^8", "body^3"],
"type": "most_fields"
}
}
}

JSON的检索结果比较多，所以就不全部给出了。给出部分跟我们计算相关得:
1. 首先是apple在文档１的title中的计算得分：

{
"value": 0.14224225,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.4784993,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.80482966,
"description": "queryNorm"
}
]
},
{
"value": 0.29726744,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.5,
"description": "fieldNorm(doc=0)"
}
]
}
]
}

field norm = 1 / √3 = 0.5773502691896258, 但是由于elasticsearch只采用了一个字节保存这个norm值，所以精度丢失，变成了0.5。

apple 在title的　idf 为　idf1: idf1 = 0.5945349
apple 在body的　idf 为　idf2: idf2 = 0.5945349
apple在title,body两个字段都在两个文档中出现过，所以idf1=idf2=1+log(2/3)
iphone 在title的　idf 为　idf3: idf3 = 1 = 1 + log(2/2)
iphone 在body的　idf 为　idf4: idf4 = 1
iphone在title, body两个字段都只在一个文档中出现，所以idf1=idf2 = 1 + log(2/2)

1/8 * 1/8 * (idf1 * idf1 * 8 * 8 + idf2 * idf2 * 3 * 3 + idf3 * idf3 * 8 * 8 + idf4 * idf4 * 3 * 3) = 1.543803711784605
queryNorm = 1/Math.sqrt(1.543803711784605) = 0.8048296354648813

1/3 * 1/3 * (idf1 * idf1 * 8 * 8 + idf2 * idf2 * 3 * 3 + idf3 * idf3 * 8 * 8 + idf4 * idf4 * 3 * 3)

## queryNorm compare

field boost 比为　8/3
queryNorm 比为　0.80482966/0.30181113 = 8/3

Ｏ（∩＿∩）Ｏ哈哈～　COOL!

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.6467803,
"hits": [
{
"_shard": 0,
"_node": "hwVl0ucyS_6Ps9-xQ2Ihbw",
"_index": "test",
"_type": "news",
"_id": "1",
"_score": 0.6467803,
"_source": {
"title": "apple released iphone",
"body": "last day, apple company has released their latest product iphone 6, which is the biggest ihpone in histroy"
},
"_explanation": {
"value": 0.6467803,
"description": "sum of:",
"details": [
{
"value": 0.5446571,
"description": "sum of:",
"details": [
{
"value": 0.14224225,
"description": "weight(title:appl in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.14224225,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.4784993,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.80482966,
"description": "queryNorm"
}
]
},
{
"value": 0.29726744,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.5,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
},
{
"value": 0.40241483,
"description": "weight(title:iphon in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.40241483,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.80482966,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)"
},
{
"value": 0.80482966,
"description": "queryNorm"
}
]
},
{
"value": 0.5,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)"
},
{
"value": 0.5,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.10212321,
"description": "sum of:",
"details": [
{
"value": 0.026670424,
"description": "weight(body:appl in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.026670424,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.17943723,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.30181113,
"description": "queryNorm"
}
]
},
{
"value": 0.14863372,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.25,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
},
{
"value": 0.07545278,
"description": "weight(body:iphon in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.07545278,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.30181113,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)"
},
{
"value": 0.30181113,
"description": "queryNorm"
}
]
},
{
"value": 0.25,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)"
},
{
"value": 0.25,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
}
]
}
},
{
"_shard": 0,
"_node": "hwVl0ucyS_6Ps9-xQ2Ihbw",
"_index": "test",
"_type": "news",
"_id": "2",
"_score": 0.08997996,
"_source": {
"title": "microsoft suied apple",
"body": "microsoft told that apple has used many of their patents, apple need to pay for these patents for 12 billion"
},
"_explanation": {
"value": 0.08997996,
"description": "sum of:",
"details": [
{
"value": 0.07112113,
"description": "product of:",
"details": [
{
"value": 0.14224225,
"description": "sum of:",
"details": [
{
"value": 0.14224225,
"description": "weight(title:appl in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.14224225,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.4784993,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.80482966,
"description": "queryNorm"
}
]
},
{
"value": 0.29726744,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.5,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
},
{
"value": 0.018858837,
"description": "product of:",
"details": [
{
"value": 0.037717674,
"description": "sum of:",
"details": [
{
"value": 0.037717674,
"description": "weight(body:appl in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.037717674,
"description": "score(doc=0,freq=2.0), product of:",
"details": [
{
"value": 0.17943723,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.30181113,
"description": "queryNorm"
}
]
},
{
"value": 0.21019982,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1.4142135,
"description": "tf(freq=2.0), with freq of:",
"details": [
{
"value": 2,
"description": "termFreq=2.0"
}
]
},
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.25,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
}
]
}
}
]
}
}
1
0

* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
个人资料
• 访问：1139827次
• 积分：14368
• 等级：
• 排名：第827名
• 原创：249篇
• 转载：76篇
• 译文：4篇
• 评论：601条
文章分类
阅读排行
最新评论