elasticsearch评分所用到的算法

最新推荐文章于 2024-01-17 08:15:00 发布

树叶要走风怎么挽留

最新推荐文章于 2024-01-17 08:15:00 发布

阅读量1.2k

点赞数

分类专栏：技术使用总结 elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/weixin_44993313/article/details/106224031

版权

技术使用总结同时被 2 个专栏收录

102 篇文章 0 订阅

订阅专栏

elasticsearch

48 篇文章 2 订阅

订阅专栏

1、算法介绍

relevance /ˈreləvəns/ score算法，简单来说，就是计算出，一个索引中的文本，与搜索文本，他们之间的关联匹配程度

Elasticsearch使用的是 term frequency /ˈfriːkwənsi/ /inverse document frequency算法，简称为TF/IDF算法

Term frequency：搜索文本中的各个词条在field文本中出现了多少次，出现次数越多，就越相关

搜索请求：hello world

doc1：hello you, and world is very good
doc2：hello, how are you
doc1 中满足 hello，word 俩个词条， doc2中仅满足hello 所以doc1越相关。分数越高

Inverse document frequency：搜索文本中的各个词条在整个索引的所有文档中出现了多少次，出现的次数越多，就越不相关

搜索请求：hello world

doc1：hello, today is very good
doc2：hi world, how are you

比如说，在index中有1万条document，hello这个单词在所有的document中，一共出现了1000次；world这个单词在所有的document中，一共出现了100次

doc2更相关
因为doc1中满足条件的是 hello 而hello出现的概率较大，world在整个文档中出现的概率更小一些，所以doc2更加相关 doc2的分就越高

Field-length norm：field长度，field越长，相关度越弱

搜索请求：hello world

doc1：{ “title”: “hello article”, “content”: “babaaba 1万个单词” }
doc2：{ “title”: “my article”, “content”: “blablabala 1万个单词，hi world” }

hello world在整个index中出现的次数是一样多的

doc1更相关，title field更短
同理，因为doc1中title字段中的文字更少，满足条件的概率也就更低，相关性就越高，所以doc1的分越高

2、_score是如何被计算出来的

elasticsearch中通过TF&IDF算法算出相应的评分信息

通过explain=true 可以查看满足搜索条件的词条详细得分情况

GET raven_index/_search?explain=true
{
  "query": {
    "match": {
      "address": "中国陕西西安西南角色"
    }
  }
}


{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 2.9424875,
    "hits" : [
      {
        "_shard" : "[raven_index][0]",
        "_node" : "rdTRZzVlQwKe0JWnYKyylA",
        "_index" : "raven_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.9424875,
        "_source" : {
          "address" : "陕西西安",
          "age" : 18,
          "name" : "王宝强"
        },
        "_explanation" : {
          "value" : 2.9424875,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.9808292,
              "description" : "weight(address:陕西 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.9808292,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.9808292,
              "description" : "weight(address:西西 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.9808292,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.9808292,
              "description" : "weight(address:西安 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.9808292,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[raven_index][0]",
        "_node" : "rdTRZzVlQwKe0JWnYKyylA",
        "_index" : "raven_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.135697,
        "_source" : {
          "address" : "中国上海",
          "age" : 20,
          "name" : "王祖蓝"
        },
        "_explanation" : {
          "value" : 1.135697,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 1.135697,
              "description" : "weight(address:中国 in 1) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 1.135697,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.5263158,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[raven_index][0]",
        "_node" : "rdTRZzVlQwKe0JWnYKyylA",
        "_index" : "raven_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.86312973,
        "_source" : {
          "address" : "广西南宁",
          "age" : 30,
          "name" : "王祖贤"
        },
        "_explanation" : {
          "value" : 0.86312973,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.86312973,
              "description" : "weight(address:西南 in 2) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.86312973,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.4,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 4.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

大神详解：https://blog.csdn.net/molong1208/article/details/50623948/

3、分析一个document是如何被匹配上的

通过查看指定的文档是如何被匹配上的，是否会被匹配

es.7版本官方已经不建议指定type 所以直接使用 get 索引名/_explain/id 进行判断

es 5版本则通过 get 索引名/类型/id/_explain进行查看

GET raven_index/_explain/1
{
  "query": {
    "match": {
      "address": "中国陕西西安西南角色"
    }
  }
}

GET /test_index/test_type/6/_explain
{
  "query": {
    "match": {
      "test_field": "test hello"
    }
  }
}