ES官网reference翻译文章(22)—Top Hits Aggregation

最新推荐文章于 2022-10-10 23:52:10 发布

_silverBlack

最新推荐文章于 2022-10-10 23:52:10 发布

阅读量726

点赞数

分类专栏： elasticsearch

原文链接：https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

版权

elasticsearch 专栏收录该内容

23 篇文章 2 订阅

订阅专栏

对ES官网的reference的翻译，同时也是备忘，ES版本为7.5

下面是正文翻译，附上原文链接：

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

==================================================================================================

高命中聚合

高命中指标聚合器记录最相关的聚合文档。该聚合器旨在用作子聚合器，以便每个桶最匹配的文档能够被聚合。

高命中聚合器可以高效地基于桶聚合器产生的特定字段对结果集进行分组。一个或者多个桶聚合器决定了结果集基于什么特性进行分组。

选项

from：与你要获取的第一个结果的偏移量

size：每个桶返回的最大数量的高命中结果，每个桶默认会返回前3个匹配的命中结果。

sort：匹配的高命中结果应该如何排序，默认的，命中结果是基于主查询的打分进行排序的。

支持的每个命中的特性

top_hits聚合返回常规搜索命中，出于这个原因，top_hits可以支持很多命中特性（这些特性大部分是query相关的，后面会翻译到；还有一些是脚本相关的特性，这里暂时不涉及）：

1）highlighting

2）exlpain

3）named filters and queries

4）source filtering

5）stored fields

6）script fields

7）doc value fields

8）include versions

9）include sequence numbers and primary terms

例子

在下面的例子中，我们按照类型对销售记录进行分组并输出每组的最后一次销售记录，在返回的销售记录中只需要包含date和price这两个字段。

curl -X POST http://host_ip:host_port/sales/_search?pretty
-H 'content-type: application/json'
-d '{
    "aggs": {
        "top_tags": {
            "terms": {
                "field": "type",
                "size": 3
            },
            "aggs": {
                "top_sales_hits": {
                    "top_hits": {
                        "sort": [
                            {
                                "date": {
                                    "order": "desc"
                                }
                            }
                        ],
                        "_source": {
                            "includes": ["date", "price"]
                        },
                        "size": 1
                    }
                }
            }
        }
    }
}'

我们来分析一下上面的请求，可以看到上面的请求中其实包含了两种聚合：terms聚合和top_hits聚合，terms聚合属于bucket聚合，后面会提到，top_hits聚合就是本文要介绍的高命中聚合，属于指标聚合。

结合上面提到的：高命中指标聚合器旨在用作子聚合器，以便每个桶最匹配的文档能够被聚合。高命中聚合器可以高效地基于桶聚合器产生的特定字段对结果集进行分组。一个或者多个桶聚合器决定了结果集基于什么特性进行分组。

这段话告诉我们上面的terms聚合先基于type字段对sales索引中的文档进行分组，同时通过设置size参数使terms查询先返回三个桶，然后使用top_hits对terms查询返回的三个桶进行子聚合，返回最匹配的1个文档(size=1)的date和price字段，匹配的依据是date倒序(最新的销售记录在最前面)。

上面的请求会返回类似于下面的响应：

{
  ...
  "aggregations": {
    "top_tags": {
       "doc_count_error_upper_bound": 0,
       "sum_other_doc_count": 0,
       "buckets": [
          {
             "key": "hat",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total" : {
                       "value": 3,
                       "relation": "eq"
                   },
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmauCQpcRyxw6ChK",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 200
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "t-shirt",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total" : {
                       "value": 3,
                       "relation": "eq"
                   },
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmauCQpcRyxw6ChL",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 175
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "bag",
             "doc_count": 1,
             "top_sales_hits": {
                "hits": {
                   "total" : {
                       "value": 1,
                       "relation": "eq"
                   },
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmatCQpcRyxw6ChH",
                         "_source": {
                            "date": "2015/01/01 00:00:00",
                            "price": 150
                         },
                         "sort": [
                            1420070400000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          }
       ]
    }
  }
}

可以看到确实只返回了三种类型的销售记录(三个桶，分别对应hat，t-shirt以及bag)，同时hat，t-shirt以及bag的销售记录总数分别为3，3，1。top_hits子聚合中可以看到最新的hat，t-shirt以及bag的销售记录时间分别在2015/03/01 00:00:00，2015/03/01 00:00:00以及2015/01/01 00:00:00，价格分别是200，175和150.

字段坍塌举例

字段坍塌或者结果聚合是一种将结果集分组并返回每个组排名靠前的文档的特性。组与组之间的顺序由每个组内第一个文档的相关性决定。在ES中，这种特性可以通过桶聚合器中包含一个top_hits的子聚合器实现。

在下面的例子中，我们搜索抓取到的网页。每个网页我们都会存储body和网页所属的domain。通过在domain字段上定义一个terms聚合器，我们将网页的结果集按照domain进行分组。接着我们定义top_hits聚合器为子聚合器以便每个桶内高匹配命中结果能够被收集。

terms聚合器的order字段使用了定义的max聚合器基于每个桶内最相关文档的相关顺序返回桶。

curl -X POST http://host_ip:host_port/_sales/_search?pretty
-H 'content-type: application/json'
-d '{
    "query": {
        "match": {
            "body": "elections"
        }
    },
    "aggs": {
        "top_sites": {
            "terms": {
                "field": "domain",
                "order": {
                    "top_hit": "desc"
                }
            },
            "aggs": {
                "top_tag_hits": {
                    "top_hits": {}
                },
                "top_hits": {
                    "max": {
                        "script": {
                            "source": "_score"
                        }
                    }
                }
            }
        }
    }
}'

目前，我们需要使用max(或者min)聚合来确保terms聚合得到的桶是按照每个domain最相关网页的打分来排序的。遗憾的是，目前top_hits聚合尚不能用于terms聚合的order选项中。

嵌套或者反向嵌套聚合器中的top_hits支持

如果将top_hits聚合器包裹在嵌套或反向嵌套聚合器内部，则将返回嵌套命中。从某种意义上说，嵌套命中是隐藏的迷你文档，它们是在映射中配置了嵌套字段类型的常规文档的一部分。如果top_hits聚集器被放置于嵌套或反向嵌套的聚集器内部，则可以取消隐藏这些文档。在嵌套类型映射中获取有关嵌套的更多信息。

如果已配置嵌套类型，则单个文档实际上将被索引为多个Lucene文档，并且它们共享相同的ID。为了确定嵌套命中的身份，除了ID之外，还需要更多的信息，这就是为什么嵌套命中响应中还包括对应命中的嵌套标识的原因。嵌套标识保留在搜索匹配的_nested字段下，并包括数组字段和嵌套命中所属的数组字段中的偏移量。偏移量基于零。

让我们看一个真实的例子，考虑下面的映射（comments是一个数组，用于在product对象下保存嵌套文档。）：

curl -XPUT http://host_ip:host_port/sales/_mappings?pretty 
-H 'Content-Type: application/json' 
-d'
{
    "properties" : {
        "tags" : { 
            "type" : "keyword" 
        },
        "comments" : { 
            "type" : "nested",
            "properties" : {
                "username" : { 
                    "type" : "keyword" 
                },
                "comment" : { 
                    "type" : "text" 
                }
            }
        }
    }
}'

mappings显示sales索引中的每个文档包含keyword类型的tags字段以及nested类型的comments字段，comments字段中嵌套了keyword类型的username字段以及text类型的comment字段。keyword和text的区别在于是否分词，前者不分词，后者可分词。分词与否会影响query的结果。

添加一些文档：

curl -XPUT http://host_ip:host_port/sales/_doc/1?refresh&pretty 
-H 'Content-Type: application/json' 
-d'
{
    "tags": ["car", "auto"],
    "comments": [
        {
            "username": "baddriver007", 
            "comment": "This car could have better brakes"
        },
        {
            "username": "dr_who", 
            "comment": "Where\u0027s the autopilot? Can\u0027t find it"
        },
        {
            "username": "ilovemotorbikes", 
            "comment": "This car has two extra wheels"
        }
    ]
}'

接下来我们可以执行下面的top_hits聚合（包裹在嵌套聚合内部）：

curl -X POST http://host_ip:host_port/sales/_search?pretty
-H 'content-type: application/json'
-d '{
    "query": {
        "term": {
            "tags": "car"
        }
    },
    "aggs": {
        "by_sale": {
            "nested": {
                "path": "comments"
            },
            "aggs": {
                "by_user": {
                    "terms": {
                        "field": "comments.username",
                        "size": 1
                    },
                    "aggs": {
                        "by_nested": {
                            "top_hits": {}
                        }
                    }
                }
            }
        }
    }
}'

分析一波上面的请求：首先使用term查询匹配tags字段为car的文档，接着对这些文档进行名为by_sale的嵌套聚合操作（path选项指出comments字段为嵌套字段），by_sale嵌套聚合内部包裹了一个名为by_user的terms聚合（基于comments.username字段进行分组，返回一个桶的结果），在terms聚合返回的一个桶的基础上再进行top_hits聚合。

下面是高命中响应的片段，该片段包含了一个位于数组字段comments的第一部分的嵌套命中：

{
  ...
  "aggregations": {
    "by_sale": {
      "by_user": {
        "buckets": [
          {
            "key": "baddriver007",
            "doc_count": 1,
            "by_nested": {
              "hits": {
                "total" : {
                   "value": 1,
                   "relation": "eq"
                },
                "max_score": 0.3616575,
                "hits": [
                  {
                    "_index": "sales",
                    "_type" : "_doc",
                    "_id": "1",
                    "_nested": {
                      "field": "comments",  
                      "offset": 0 
                    },
                    "_score": 0.3616575,
                    "_source": {
                      "comment": "This car could have better brakes", 
                      "username": "baddriver007"
                    }
                  }
                ]
              }
            }
          }
          ...
        ]
      }
    }
  }
}

"fields"： "comments" 指出包含嵌套命中的数组字段名称，"offset"：0 指出嵌套命中在数组中的位置（如果数组中包含嵌套命中的话），"_source"：嵌套命中的来源字段

如果请求体中包含_source选项，那么只会返回嵌套对象请求的这部分来源字段中的字段，而不是文档的整个字段。内部nested对象存储的字段也能通过位于嵌套或者反向嵌套聚合器内部的top_hits聚合器获取到。

只有嵌套命中才会在响应中有_nested字段，非嵌套命中（即常规命中）不会有_nested字段。

如果_source标示没有被允许，_nested字段中的信息也能够被用来解析原始的来源字段。

如果在mapping中定义了多级嵌套对象类型，那么_nested字段中的信息也能够层级化以便表达两级或多级嵌套命中的标示。

在下面的例子中，嵌套命中位于nested_grand_child_field字段的第一部分，nested_grand_child_field字段又位于nested_child_field字段的第二部分：

...
"hits": {
 "total" : {
     "value": 2565,
     "relation": "eq"
 },
 "max_score": 1,
 "hits": [
   {
     "_index": "a",
     "_type": "b",
     "_id": "1",
     "_score": 1,
     "_nested" : {
       "field" : "nested_child_field",
       "offset" : 1,
       "_nested" : {
         "field" : "nested_grand_child_field",
         "offset" : 0
       }
     }
     "_source": ...
   },
   ...
 ]
}
...

_silverBlack

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ES官网reference翻译文章(22)—Top Hits Aggregation

对ES官网的reference的翻译，同时也是备忘，ES版本为7.5下面是正文翻译，附上原文链接：https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html============================...
复制链接

扫一扫