1、算法介绍
relevance /ˈreləvəns/ score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度
Elasticsearch使用的是 term frequency /ˈfriːkwənsi/ /inverse document frequency算法,简称为TF/IDF算法
Term frequency:搜索文本中的各个词条在field文本中出现了多少次,出现次数越多,就越相关
搜索请求:hello world
doc1:hello you, and world is very good
doc2:hello, how are you
doc1 中 满足 hello,word 俩个词条, doc2中仅满足hello 所以doc1越相关。分数越高
Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关
搜索请求:hello world
doc1:hello, today is very good
doc2:hi world, how are you
比如说,在index中有1万条document,hello这个单词在所有的document中,一共出现了1000次;world这个单词在所有的document中,一共出现了100次
doc2更相关
因为doc1中满足条件的是 hello 而hello出现的概率较大,world在整个文档中出现的概率更小一些,所以doc2更加相关 doc2的分就越高
Field-length norm:field长度,field越长,相关度越弱
搜索请求:hello world
doc1:{ “title”: “hello article”, “content”: “babaaba 1万个单词” }
doc2:{ “title”: “my article”, “content”: “blablabala 1万个单词,hi world” }
hello world在整个index中出现的次数是一样多的
doc1更相关,title field更短
同理,因为doc1中title字段中的文字更少,满足条件的概率也就更低,相关性就越高,所以doc1的分越高
2、_score是如何被计算出来的
elasticsearch中通过TF&IDF算法算出相应的评分信息
通过explain=true 可以查看满足搜索条件的词条详细得分情况
GET raven_index/_search?explain=true
{
"query": {
"match": {
"address": "中国陕西西安西南角色"
}
}
}
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.9424875,
"hits" : [
{
"_shard" : "[raven_index][0]",
"_node" : "rdTRZzVlQwKe0JWnYKyylA",
"_index" : "raven_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.9424875,
"_source" : {
"address" : "陕西西安",
"age" : 18,
"name" : "王宝强"
},
"_explanation" : {
"value" : 2.9424875,
"description" : "sum of:",
"details" : [
{
"value" : 0.9808292,
"description" : "weight(address:陕西 in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.9808292,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.98082924,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 3,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.9808292,
"description" : "weight(address:西西 in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.9808292,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.98082924,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 3,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.9808292,
"description" : "weight(address:西安 in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.9808292,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.98082924,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 3,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
},
{
"_shard" : "[raven_index][0]",
"_node" : "rdTRZzVlQwKe0JWnYKyylA",
"_index" : "raven_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.135697,
"_source" : {
"address" : "中国上海",
"age" : 20,
"name" : "王祖蓝"
},
"_explanation" : {
"value" : 1.135697,
"description" : "sum of:",
"details" : [
{
"value" : 1.135697,
"description" : "weight(address:中国 in 1) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.135697,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.98082924,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 3,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.5263158,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
},
{
"_shard" : "[raven_index][0]",
"_node" : "rdTRZzVlQwKe0JWnYKyylA",
"_index" : "raven_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.86312973,
"_source" : {
"address" : "广西南宁",
"age" : 30,
"name" : "王祖贤"
},
"_explanation" : {
"value" : 0.86312973,
"description" : "sum of:",
"details" : [
{
"value" : 0.86312973,
"description" : "weight(address:西南 in 2) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.86312973,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.98082924,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 3,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.4,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
]
}
}
大神详解 :https://blog.csdn.net/molong1208/article/details/50623948/
3、分析一个document是如何被匹配上的
通过查看指定的文档是如何被匹配上的,是否会被匹配
es.7版本 官方已经不建议指定type 所以直接使用 get 索引名/_explain/id 进行判断
es 5版本 则通过 get 索引名/类型/id/_explain进行查看
GET raven_index/_explain/1
{
"query": {
"match": {
"address": "中国陕西西安西南角色"
}
}
}
GET /test_index/test_type/6/_explain
{
"query": {
"match": {
"test_field": "test hello"
}
}
}