文章目录
ElasticSearch Aggregation(四)
桶聚合
Geo-distance
地理距离聚合。工作在geo_point
类型字段的多桶聚合。在概念上与range
聚合非常相似。用户可以定义一个源点和一组距离范围内的桶。这个聚合会计算源点到每个文档的距离,并且根据距离范围来确定文档数据哪个桶(如果源点到文档的距离在桶的距离范围内,则该文档数据这个桶)。利用比较直观的说法就是,在地图上一个点,距离这个点0-1km
的文档被分到一个桶,1km-2km
分到一个桶,2km-3km
分到一个桶。
索引测试数据:
curl -X PUT "localhost:9200/museums?pretty" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
'
curl -X POST "localhost:9200/museums/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d\u0027Orsay"}
'
查询例子:
curl -X POST "localhost:9200/museums/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
"aggs": {
"rings_around_amsterdam": {
"geo_distance": {
"field": "location",
"origin": "52.3760, 4.894",
"ranges": [
{ "to": 100000 },
{ "from": 100000, "to": 300000 },
{ "from": 300000 }
]
}
}
}
}
'
响应:
{
...
"aggregations": {
"rings_around_amsterdam": {
"buckets": [
{
"key": "*-100000.0",
"from": 0.0,
"to": 100000.0,
"doc_count": 3
},
{
"key": "100000.0-300000.0",
"from": 100000.0,
"to": 300000.0,
"doc_count": 1
},
{
"key": "300000.0-*",
"from": 300000.0,
"doc_count": 2
}
]
}
}
}
被指定的字段的数据类型必须是geo_point
。它还可以保存geo_point字段的数组,在这种情况下,在聚合期间将考虑所有字段。原点可以接受geo_point类型支持的所有格式:
默认情况下,距离单位是m(米),但它也可以接受:mi(英里),in(英寸),yd(码),km(公里),cm(厘米),mm(毫米)。
curl -X POST "localhost:9200/museums/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
"aggs": {
"rings": {
"geo_distance": {
"field": "location",
"origin": "52.3760, 4.894",
"unit": "km",
"ranges": [
{ "to": 100 },
{ "from": 100, "to": 300 },
{ "from": 300 }
]
}
}
}
}
'
将 keyed 标志设置为 true 会将唯一的字符串键与每个存储桶关联,并将范围作为散列而不是数组返回:
curl -X POST "localhost:9200/museums/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
"aggs": {
"rings_around_amsterdam": {
"geo_distance": {
"field": "location",
"origin": "52.3760, 4.894",
"ranges": [
{ "to": 100000 },
{ "from": 100000, "to": 300000 },
{ "from": 300000 }
],
"keyed": true
}
}
}
}
'
响应:
{
...
"aggregations": {
"rings_around_amsterdam": {
"buckets": {
"*-100000.0": {
"from": 0.0,
"to": 100000.0,
"doc_count": 3
},
"100000.0-300000.0": {
"from": 100000.0,
"to": 300000.0,
"doc_count": 1
},
"300000.0-*": {
"from": 300000.0,
"doc_count": 2
}
}
}
}
}
还可以为每个范围自定义键:
curl -X POST "localhost:9200/museums/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
"aggs": {
"rings_around_amsterdam": {
"geo_distance": {
"field": "location",
"origin": "52.3760, 4.894",
"ranges": [
{ "to": 100000, "key": "first_ring" },
{ "from": 100000, "to": 300000, "key": "second_ring" },
{ "from": 300000, "key": "third_ring" }
],
"keyed": true
}
}
}
}
'
响应:
{
...
"aggregations": {
"rings_around_amsterdam": {
"buckets": {
"first_ring": {
"from": 0.0,
"to": 100000.0,
"doc_count": 3
},
"second_ring": {
"from": 100000.0,
"to": 300000.0,
"doc_count": 1
},
"third_ring": {
"from": 300000.0,
"doc_count": 2
}
}
}
}
}
global聚合
定义搜索执行上下文中所有文档的单个bucket。此上下文由您正在搜索的索引和文档类型定义,但不受搜索查询本身的影响。
GET my-index-000001/_search
{
"query": {
"match": {
"name": "li"
}
},
"aggs": {
"all_avg": {
"global": {},
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
}
}
},
"query_age_avg":{
"avg": {
"field": "age"
}
}
}
}
以上例子中,all_avg
聚合不受query
语句的影响,因为对象中指定了"global": {}
语句。query_age_avg
的聚合受到了query
的影响,因为它没有指定"global": {}
语句。
响应:
{
...
"aggregations" : {
"query_age_avg" : {
"value" : 27.5
},
"all_avg" : {
"doc_count" : 4,
"age_avg" : {
"value" : 29.5
}
}
}
}
histogram聚合
直方图聚合。基于多桶值源的聚合,该聚合应用于从文档中提取到的数值或者数值范围上。他在值上动态的构建指定间隔的桶。例如,如果文档中有一个价格字段,该字段是数值类型,那么我们可以配置此聚合动态构建5
元间隔的桶。当执行聚合的时候,每个文档的价格字段会被计算并且四舍五入到最近的桶内。
对于范围值,文档可以分为多个存储桶。第一个bucket是从范围的下限计算的,计算方法与计算单个值的bucket相同。最后一个bucket的计算方法与范围上限的计算方法相同,并且范围将计入介于两者之间的所有bucket(包括这两个bucket)。
inteval
必须为正小数,而offset
必须为[0,interval)
中的小数(大于等于0且小于interval的小数)
以下例子是根据产品的价格按照50
的间隔来分桶:
GET my-index-000001/_search{ "size": 0, "aggs": { "age": { "histogram": { "field": "age", "interval": 3 } } }}
响应:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age" : {
"buckets" : [
{
"key" : 24.0,
"doc_count" : 1
},
{
"key" : 27.0,
"doc_count" : 2
},
{
"key" : 30.0,
"doc_count" : 0
},
{
"key" : 33.0,
"doc_count" : 1
}
]
}
}
}
最小文档个数
以下例子中的响应可以看出,在27-30
的年龄区间中不存在文档。默认情况下,响应将用空桶填充直方图中的空白。你可以通过min_doc_count
参数来限制返回返回的文档,如果桶中的文档小于该值,那么则不会返回该桶,只会返回桶中文档大于该值的桶。
GET my-index-000001/_search
{
"size": 0,
"aggs": {
"age": {
"histogram": {
"field": "age",
"interval": 3,
"min_doc_count": 2
}
}
}
}
响应 :
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age" : {
"buckets" : [
{
"key" : 24.0,
"doc_count" : 1
},
{
"key" : 27.0,
"doc_count" : 2
},
{
"key" : 33.0,
"doc_count" : 1
}
]
}
}
}
默认情况下,histgram
将返回数据本身范围内的所有桶。即具有最小值(带有直方图)的文档将确定最小桶(具有最小键的桶)和具有最高值的文档values 将决定最大存储桶(具有最高键的存储桶)。
IP范围聚合
就跟专用日期范围聚合一样,IP也有自己的专用范围聚合。例如:
GET my-index-000001/_search
{
"size": 0,
"aggs": {
"address_range": {
"ip_range": {
"field": "address",
"ranges": [
{
"from": "192.168.251.1",
"to": "192.168.251.11"
}
]
}
}
}
}
提示:ip_range
必须操作在字段映射为ip
类型的字段上。
将 keyed 标志设置为 true 会将唯一的字符串键与每个存储桶关联,并将范围作为散列而不是数组返回:
curl -X GET "localhost:9200/ip_addresses/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"ip_ranges": {
"ip_range": {
"field": "ip",
"ranges": [
{ "to": "10.0.0.5" },
{ "from": "10.0.0.5" }
],
"keyed": true
}
}
}
}
'
响应:
{
...
"aggregations": {
"ip_ranges": {
"buckets": {
"*-10.0.0.5": {
"to": "10.0.0.5",
"doc_count": 10
},
"10.0.0.5-*": {
"from": "10.0.0.5",
"doc_count": 260
}
}
}
}
}
缺失聚合
一种基于字段数据的单bucket聚合,它创建一个bucket,其中包含当前文档集上下文中缺少字段值(实际上是缺少字段或配置了空值集)的所有文档。此聚合器通常与其他字段数据桶聚合器(例如范围)一起使用,以返回由于缺少字段数据值而无法放入任何其他桶中的所有文档的信息。
例子:
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
"aggs": {
"products_without_a_price": {
"missing": { "field": "price" }
}
}
}
'
在上面的例子中,我们得到了没有价格的产品总数。
{ ... "aggregations": { "products_without_a_price": { "doc_count": 00 } }}
mutil terms聚合
一种基于多桶值源的聚合,其中桶是动态构建的。multi-terms
聚合与术语聚合非常相似,但是在大多数情况下,它会比terms
聚合慢,并且会消耗更多内存。因此,如果经常使用同一组字段,则将此数组字段作为单独的字段进行索引,并在该字段上使用聚合,会更高效。
当需要按文档数量或复合键上的指标聚合进行排序并获得前N个结果时,多项聚合是最有用的。如果不需要排序,并且希望使用嵌套术语检索所有值,那么聚合或composite aggregations
将是一种更快、更有效的内存解决方案。例如:
curl -X GET "localhost:9200/products/_search?pretty" -H 'Content-Type: application/json' -d'
{
"aggs": {
"genres_and_products": {
"multi_terms": {
"terms": [{
"field": "genre"
}, {
"field": "product"
}]
}
}
}
}
'
响应:
{
...
"aggregations" : {
"genres_and_products" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : [
"rock",
"Product A"
],
"key_as_string" : "rock|Product A",
"doc_count" : 2
},
{
"key" : [
"electronic",
"Product B"
],
"key_as_string" : "electronic|Product B",
"doc_count" : 1
},
{
"key" : [
"jazz",
"Product B"
],
"key_as_string" : "jazz|Product B",
"doc_count" : 1
},
{
"key" : [
"rock",
"Product B"
],
"key_as_string" : "rock|Product B",
"doc_count" : 1
}
]
}
}
}
提示:默认情况下,multi_terms
聚合将返回按 doc_count
排序的前十个术语的桶。可以通过设置 size
参数来更改此默认行为。
聚合参数
以下参数将被支持。有关这些参数的更详细说明,请参阅术语聚合。
size | 可选的,定义应从整个术语列表中返回多少个术语桶。默认为 10。 |
---|---|
shard_size | 可选的。请求的大小越大,结果就越准确,但计算最终结果的成本也越高。默认的 shard_size 是 (size * 1.5 + 10)。 |
show_term_doc_count_error | 可选的。计算每个术语的文档计数错误。默认为false |
order | 可选的。指定桶的顺序。默认为每个存储桶的文档数 |
min_doc_count | 可选的。存储桶中要返回的最小文档数。默认为 1。 |
shard_min_doc_count | 可选的。每个分片上存储桶中要返回的最小文档数。默认为 min_doc_count 。 |
collect_mode | 可选的。指定数据收集策略。支持depth_first 或breadth_first 模式。默认为breadth_first 。 |
脚本
使用脚本生成术语
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
"runtime_mappings": {
"email_length_runtime": {
"type": "long",
"script": "emit(doc['\''email'\''].value.length())"
}
},
"aggs": {
"email_lengt_runtime_agg": {
"multi_terms": {
"terms": [
{
"field": "email_length_runtime"
},
{
"field": "address"
}
]
}
}
}
}'
响应:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"email_lengt_runtime_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : [
12,
"192.168.251.10"
],
"key_as_string" : "12|192.168.251.10",
"doc_count" : 1
},
{
"key" : [
13,
"192.168.251.11"
],
"key_as_string" : "13|192.168.251.11",
"doc_count" : 1
},
{
"key" : [
13,
"192.168.251.189"
],
"key_as_string" : "13|192.168.251.189",
"doc_count" : 1
}
]
}
}
}
缺失值
missing
参数定义了如何处理缺失值的文档。默认情况下,如果缺少字段值,整个文档将被忽略,但也可以使用缺少的参数将它们视为具有值。
curl -X GET "localhost:9200/products/_search?pretty" -H 'Content-Type: application/json' -d'
{
"aggs": {
"genres_and_products": {
"multi_terms": {
"terms": [
{
"field": "genre"
},
{
"field": "product",
"missing": "Product Z"
}
]
}
}
}
}
'
响应:
{
...
"aggregations" : {
"genres_and_products" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : [
"rock",
"Product A"
],
"key_as_string" : "rock|Product A",
"doc_count" : 2
},
{
"key" : [
"electronic",
"Product B"
],
"key_as_string" : "electronic|Product B",
"doc_count" : 1
},
{
"key" : [
"electronic",
"Product Z"
],
"key_as_string" : "electronic|Product Z",
"doc_count" : 1
},
{
"key" : [
"jazz",
"Product B"
],
"key_as_string" : "jazz|Product B",
"doc_count" : 1
},
{
"key" : [
"rock",
"Product B"
],
"key_as_string" : "rock|Product B",
"doc_count" : 1
}
]
}
}
}
混合字段类型
当聚合多个索引时,所有索引中聚合字段的类型可能不相同。有些类型彼此兼容(integer和long或者float和double
),但当类型是十进制和非十进制数的混合时,术语聚合将非十进制数提升为十进制数。这可能会导致桶值的精度损失。
子查询和排序
multi_terms
支持子聚合和以及通过子聚合的指标进行排序。
GET my-index-000001/_search
{
"size": 0,
"runtime_mappings": {
"email_length_runtime": {
"type": "long",
"script": "emit(doc['email'].value.length())"
}
},
"aggs": {
"email_lengt_runtime_agg": {
"multi_terms": {
"terms": [
{
"field": "email_length_runtime"
},
{
"field": "birthday"
}
],
"order": {
"age_total": "desc"
}
},
"aggs": {
"age_total": {
"sum": {
"field": "age"
}
}
}
}
}
}
响应:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"email_lengt_runtime_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : [
13,
"2015-11-01T05:30:00.000Z"
],
"key_as_string" : "13|2015-11-01T05:30:00.000Z",
"doc_count" : 2,
"age_total" : {
"value" : 28.0
}
},
{
"key" : [
12,
"2015-11-01T05:30:00.000Z"
],
"key_as_string" : "12|2015-11-01T05:30:00.000Z",
"doc_count" : 1,
"age_total" : {
"value" : 12.0
}
}
]
}
}
}
注意:以上的order
排序是在聚合指标之后执行的,是对指标结果进行的排序
嵌套聚合
一种特殊的单桶聚合,可以聚合嵌套文档。
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
"size": 0,
"query": {
"nested": {
"path": "address",
"query": {
"term": {
"address.country": {
"value": "中国"
}
}
}
}
},
"aggs": {
"nested_agg_test": {
"nested": {
"path": "address"
},
"aggs": {
"shouru_count": {
"multi_terms": {
"terms": [
{
"field": "address.country"
},
{
"field": "address.city"
}
]
},
"aggs": {
"shouru_count": {
"sum": {
"field": "address.shouru"
}
}
}
}
}
}
}
}'
响应:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"nested_agg_test" : {
"doc_count" : 3,
"shouru_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : [
"中国",
"北京"
],
"key_as_string" : "中国|北京",
"doc_count" : 2,
"shouru_count" : {
"value" : 600.0
}
},
{
"key" : [
"中国",
"石家庄"
],
"key_as_string" : "中国|石家庄",
"doc_count" : 1,
"shouru_count" : {
"value" : 200.0
}
}
]
}
}
}
}
range聚合
一个基于多桶值源的聚合,可以让用户定义一个范围集合,每个范围就是一个桶。在聚合期间,从每个文档中提取到的值都会检查该值是属于哪个范围桶的。范围的区间包含两个参数from
和to
。区间定义为[from,to)
。例如:
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'{ "size": 0, "aggs": { "age_range_test": { "range": { "field": "age", "ranges": [ { "to": 10 }, { "from": 10, "to": 20 }, { "from": 20, "to": 30 }, { "from": 30 } ] }, "aggs": { "age_avg": { "avg": { "field": "age" } } } } }}'
响应:
{ ... "aggregations" : { "age_range_test" : { "buckets" : [ { "key" : "*-10.0", "to" : 10.0, "doc_count" : 0, "age_avg" : { "value" : null } }, { "key" : "10.0-20.0", "from" : 10.0, "to" : 20.0, "doc_count" : 1, "age_avg" : { "value" : 18.0 } }, { "key" : "20.0-30.0", "from" : 20.0, "to" : 30.0, "doc_count" : 1, "age_avg" : { "value" : 20.0 } }, { "key" : "30.0-*", "from" : 30.0, "doc_count" : 1, "age_avg" : { "value" : 30.0 } } ] } }}
keyed response
将 keyed 标志设置为 true 会将唯一的字符串键与每个存储桶关联,并将范围作为散列而不是数组返回:
curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'{ "size": 0, "aggs": { "age_range_test": { "range": { "field": "age", "keyed": true, "ranges": [ { "to": 10 }, { "from": 10, "to": 20 }, { "from": 20, "to": 30 }, { "from": 30 } ] }, "aggs": { "age_avg": { "avg": { "field": "age" } } } } }}'
响应:
{
...
"aggregations" : {
"age_range_test" : {
"buckets" : {
"*-10.0" : {
"to" : 10.0,
"doc_count" : 0,
"age_avg" : {
"value" : null
}
},
"10.0-20.0" : {
"from" : 10.0,
"to" : 20.0,
"doc_count" : 1,
"age_avg" : {
"value" : 18.0
}
},
"20.0-30.0" : {
"from" : 20.0,
"to" : 30.0,
"doc_count" : 1,
"age_avg" : {
"value" : 20.0
}
},
"30.0-*" : {
"from" : 30.0,
"doc_count" : 1,
"age_avg" : {
"value" : 30.0
}
}
}
}
}
}