Elasticsearch 学习踩坑之路—相同数据不同分数原因

最新推荐文章于 2023-08-09 17:54:32 发布

水晶果冻1125

最新推荐文章于 2023-08-09 17:54:32 发布

阅读量2.5k

点赞数 1

分类专栏： Elasticsearch 文章标签： Elasticsearch 相关性打分同数据不同分数 dfs_query_then_fetch

本文链接：https://blog.csdn.net/m0_37617778/article/details/102579643

版权

Elasticsearch 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

最近在学习Elasticsearch，进行match查询时发现数据都是同样的内容，但是命中的结果得分却不相同，感到很困惑，示例如下：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "Li",
          "last_name" : "Haijing",
          "age" : "35",
          "about" : "I like to shopping foods",
          "interests" : [
            "forestry"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "Douglas",
          "last_name" : "Fir",
          "age" : 35,
          "about" : "I like to build cabinets",
          "interests" : [
            "forestry"
          ]
        }
      }
    ]
  }
}

其中id为1和2的文档中的"last_name" 都为"Smith"，于是我对last_name进行搜素

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

执行结果如下：

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.6931472,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

从结果可见同样的数据返回的得分并不一样

id 1 0.6931472

id 2 0.2876821

显然这和我们预期的结果并不一致，于是开始分析不一致的原因，以下内容参考以下网址

https://www.jianshu.com/p/c7529b98993e

默认搜索类型：`query then fetch`

默认情形下，ES会使用一个称之为Query then fetch的搜索类型。它运作的方式如下：

发送查询到每个shard
找到所有匹配的文档，并使用本地的TF-IDF信息进行打分
对结果构建一个优先队列（排序，标页等）
返回关于结果的元数据到请求节点。注意，实际文档还没有发送，只是分数
来自所有shard的分数合并起来，并在请求节点上进行排序，文档被按照查询要求进行选择
最终，实际文档从他们各自所在的独立的shard上检索出来
结果被返回给用户

这个系统一般是能够良好地运作的。大多数情形下，你的索引有足够的文档来平衡本地的TF-IDF统计信息。因此，尽管每个shard不一定拥有完整的关于整个cluster的frequency信息，结果仍然足够好，因为fequency在每个地方基本上是类似的。

但是在我们开头提起的那个查询实例中，默认搜索类型是失败的（备注：因我的文档数据较少只有四条，但有5个shard，因此每次搜索都是失败的）。

dfs query then fetch

ES通常使用5个shard，每个shard仅仅包含一个或者两个文档（ES使用hash确保随机分布）。当我们要求ES计算分数时候，每个shard仅仅拥有关于五个文档的一个很窄的视角。所以分数是不准确的。

幸运的是，ES并没有让你无所适从。如果你遇到了这样的打分偏离的情形，ES提供了一个称为“DFS Query Then Fetch”。这个过程基本和Query Then Fetch类型，除了它执行了一个预查询来计算整体文档的frequency。

预查询每个shard，询问Term frequency和Document frequency
发送查询到每个shard
找到所有匹配的文档，并使用全局的Term Frequency/Inverse Document Frequency信息进行打分
对结果构建一个优先队列（排序，标页等）
返回关于结果的元数据到请求节点。注意，实际文档还没有发送，只是分数
来自所有shard的分数合并起来，并在请求节点上进行排序，文档被按照查询要求进行选择
最终，实际文档从他们各自所在的独立的shard上检索出来
结果被返回给用户

如果我们使用这个新的搜索类型，返回的分数就是相同的

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.6931472,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}