elasticsearch评分所用到的算法

1、算法介绍

relevance /ˈreləvəns/ score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度

Elasticsearch使用的是 term frequency /ˈfriːkwənsi/ /inverse document frequency算法,简称为TF/IDF算法

Term frequency:搜索文本中的各个词条在field文本中出现了多少次,出现次数越多,就越相关

搜索请求:hello world

doc1:hello you, and world is very good
doc2:hello, how are you
doc1 中 满足 hello,word 俩个词条, doc2中仅满足hello 所以doc1越相关。分数越高

Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关

搜索请求:hello world

doc1:hello, today is very good
doc2:hi world, how are you

比如说,在index中有1万条document,hello这个单词在所有的document中,一共出现了1000次;world这个单词在所有的document中,一共出现了100次

doc2更相关
因为doc1中满足条件的是 hello 而hello出现的概率较大,world在整个文档中出现的概率更小一些,所以doc2更加相关 doc2的分就越高

Field-length norm:field长度,field越长,相关度越弱

搜索请求:hello world

doc1:{ “title”: “hello article”, “content”: “babaaba 1万个单词” }
doc2:{ “title”: “my article”, “content”: “blablabala 1万个单词,hi world” }

hello world在整个index中出现的次数是一样多的

doc1更相关,title field更短
同理,因为doc1中title字段中的文字更少,满足条件的概率也就更低,相关性就越高,所以doc1的分越高

2、_score是如何被计算出来的

elasticsearch中通过TF&IDF算法算出相应的评分信息

通过explain=true 可以查看满足搜索条件的词条详细得分情况

GET raven_index/_search?explain=true
{
  "query": {
    "match": {
      "address": "中国陕西西安西南角色"
    }
  }
}


{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 2.9424875,
    "hits" : [
      {
        "_shard" : "[raven_index][0]",
        "_node" : "rdTRZzVlQwKe0JWnYKyylA",
        "_index" : "raven_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.9424875,
        "_source" : {
          "address" : "陕西西安",
          "age" : 18,
          "name" : "王宝强"
        },
        "_explanation" : {
          "value" : 2.9424875,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.9808292,
              "description" : "weight(address:陕西 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.9808292,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.9808292,
              "description" : "weight(address:西西 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.9808292,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.9808292,
              "description" : "weight(address:西安 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.9808292,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[raven_index][0]",
        "_node" : "rdTRZzVlQwKe0JWnYKyylA",
        "_index" : "raven_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.135697,
        "_source" : {
          "address" : "中国上海",
          "age" : 20,
          "name" : "王祖蓝"
        },
        "_explanation" : {
          "value" : 1.135697,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 1.135697,
              "description" : "weight(address:中国 in 1) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 1.135697,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.5263158,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[raven_index][0]",
        "_node" : "rdTRZzVlQwKe0JWnYKyylA",
        "_index" : "raven_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.86312973,
        "_source" : {
          "address" : "广西南宁",
          "age" : 30,
          "name" : "王祖贤"
        },
        "_explanation" : {
          "value" : 0.86312973,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.86312973,
              "description" : "weight(address:西南 in 2) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.86312973,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.4,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 4.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 3.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

大神详解 :https://blog.csdn.net/molong1208/article/details/50623948/

3、分析一个document是如何被匹配上的

通过查看指定的文档是如何被匹配上的,是否会被匹配

es.7版本 官方已经不建议指定type 所以直接使用 get 索引名/_explain/id 进行判断

es 5版本 则通过 get 索引名/类型/id/_explain进行查看

GET raven_index/_explain/1
{
  "query": {
    "match": {
      "address": "中国陕西西安西南角色"
    }
  }
}

GET /test_index/test_type/6/_explain
{
  "query": {
    "match": {
      "test_field": "test hello"
    }
  }
}
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值