文章目录
按条件聚合
例如:查询 30 岁以下的人的 “age” 统计信息
- 使用 query 条件进行过滤
{
"size": 0,
"query":{
"range":{
"age":{
"lte":30
}
}
},
"aggs" :{
"age_aggs":{
"stats":{
"field":"age"
}
}
}
}
结果:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 514,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"age_aggs": {
"count": 514,
"min": 10.0,
"max": 30.0,
"avg": 20.445525291828794,
"sum": 10509.0
}
}
}
- 使用 filter 进行过滤
{
"size": 0,
"aggs" :{
"filter_aggs":{
"filter":{
"range":{
"age":{
"lte":30
}
}
},
"aggs":{
"age_stat":{ "stats":{"field":"age"}}
}
}
}
}
注:此示例出现了 aggs 嵌套,这个我们在下面再说,此处先关注 filter 的使用
结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1003,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"filter_aggs": {
"doc_count": 514,
"age_stat": {
"count": 514,
"min": 10.0,
"max": 30.0,
"avg": 20.445525291828794,
"sum": 10509.0
}
}
}
}
filters 多条件聚合
列如:查询 10 ~ 20 岁 和 20 ~ 30 岁的文档个数
{
"size": 0,
"aggs" :{
"state_f":{
"filters":{
"filters":{
"a1020":{
"range":{"age":{"gte":"10","lt":"20"}}
},
"a2030":{
"range":{"age":{"gte":"20","lt":"30"}}
}
}
}
}
}
}
查询结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1003,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"state_f": {
"buckets": {
"a1020": {
"doc_count": 219
},
"a2030": {
"doc_count": 268
}
}
}
}
}
other_bucket_key 统计其他
列如:查询 10 ~ 20 岁 和 20 ~ 30 岁 和 其他情况的文档个数
{
"size": 0,
"aggs" :{
"state_f":{
"filters":{
"other_bucket_key": "others",
"filters":{
"a1020":{
"range":{"age":{"gte":"10","lt":"20"}}
},
"a2030":{
"range":{"age":{"gte":"20","lt":"30"}}
}
}
}
}
}
}
查询结果
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1003,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"state_f": {
"buckets": {
"a1020": {
"doc_count": 219
},
"a2030": {
"doc_count": 268
},
"others": {
"doc_count": 516
}
}
}
}
}
嵌套聚合
列如:查询 10 ~ 20 岁 和 20 ~ 30 岁 和 其他年龄段的年龄统计数据(max、min、sum、avg、count)
{
"size": 0,
"aggs" :{
"state_f":{
"filters":{
"other_bucket_key": "others",
"filters":{
"a1020":{
"range":{"age":{"gte":"10","lt":"20"}}
},
"a2030":{
"range":{"age":{"gte":"20","lt":"30"}}
}
}
},
"aggs":{
"state_age":{"stats":{"field":"age"}}
}
}
}
}
注意json对象的层数,以及对应的字段所在层数。
按聚合结果进行排序(桶排序)
列如:查询 10 ~ 20 岁 和 20 ~ 30 岁 和 其他年龄段的年龄统计数据(max、min、sum、avg、count)并按统计求和数进行降序
{
"size": 0,
"aggs" :{
"state_f":{
"filters":{
"other_bucket_key": "others",
"filters":{
"a1020":{
"range":{"age":{"gte":"10","lt":"20"}}
},
"a2030":{
"range":{"age":{"gte":"20","lt":"30"}}
}
}
},
"aggs":{
"stats_age":{"stats":{"field":"age"}},
"sum_age_sort":{
"bucket_sort":{
"sort":[{
"stats_age.sum": { "order": "desc" }
}]
}
}
}
}
}
}
因为 stats_age 返回是一个对象,所以在桶排序中sort的字段为stats_age.sum。
参数说明
sum_age_sort 是我们自定义的排序的名称,bucket_sort 表示桶排序,bucket_sort 有如下几个参数:
- sort:排序规则列表(与我们 查询进阶 一章的排序规则一样)
- from:排序的桶的开始位置。默认为0
- size:要返回的排序桶的数量。默认为所有桶。(即:默认对所有桶进行排序)
- gap_policy:在数据中发现差距时应用的策略。默认是 skip
数据的差距
我们的数据通常充满噪音或间隙,常见的原因:
- 落入桶中的文档不包含必填字段
- 没有与一个或多个存储桶的查询相匹配的文档
- 正在计算的指标无法生成值,可能是因为另一个依赖存储桶缺少值。或者要计算的值不满足特定的要求等等。
数据差距策略
- skip:将丢失的数据视为存储桶不存在。它将跳过存储桶并使用下一个可用值继续计算。
- insert_zeros(插入0):此选项将用零 ( 0) 替换缺失值,并且管道聚合计算将正常进行。
- keep_values(保留值):此选项与跳过类似,除非指标提供非空、非 NaN 值,否则使用该值,否则将跳过空存储桶。
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1003,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"state_f": {
"buckets": {
"others": {
"doc_count": 516,
"stats_age": {
"count": 516,
"min": 30.0,
"max": 50.0,
"avg": 39.9031007751938,
"sum": 20590.0
}
},
"a2030": {
"doc_count": 268,
"stats_age": {
"count": 268,
"min": 20.0,
"max": 29.0,
"avg": 24.473880597014926,
"sum": 6559.0
}
},
"a1020": {
"doc_count": 219,
"stats_age": {
"count": 219,
"min": 10.0,
"max": 19.0,
"avg": 14.337899543378995,
"sum": 3140.0
}
}
}
}
}
}
以聚合结果进行再聚合(管道聚合)
查询 10 ~ 20 岁 和 20 ~ 30 岁 和 其他年龄段的年龄统计数据(max、min、sum、avg、count),并查询平均年龄最小的桶
min_bucket 最小桶
{
"size": 0,
"aggs" :{
"state_f":{
"filters":{
"other_bucket_key": "others",
"filters":{
"a1020":{
"range":{"age":{"gte":"10","lt":"20"}}
},
"a2030":{
"range":{"age":{"gte":"20","lt":"30"}}
}
}
},
"aggs":{
"state_age":{"stats":{"field":"age"}}
}
},
"sss":{
"min_bucket":{
"buckets_path": "state_f > state_age.avg"
}
}
}
}
sss 的结果即是我们的最小桶
示例 buckets_path 说明
state_f > state_age.avg ,表示按 state_f(聚合的名称)的结果指标 state_age 的 avg 的值来度量。如果是 min_bucket 则取最小的桶。注意:> 不是小于符号
结果
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1003,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"state_f": {
"buckets": {
"a1020": {
"doc_count": 219,
"state_age": {
"count": 219,
"min": 10.0,
"max": 19.0,
"avg": 14.337899543378995,
"sum": 3140.0
}
},
"a2030": {
"doc_count": 268,
"state_age": {
"count": 268,
"min": 20.0,
"max": 29.0,
"avg": 24.473880597014926,
"sum": 6559.0
}
},
"others": {
"doc_count": 516,
"state_age": {
"count": 516,
"min": 30.0,
"max": 50.0,
"avg": 39.9031007751938,
"sum": 20590.0
}
}
}
},
"sss": {
"value": 14.337899543378995,
"keys": [
"a1020"
]
}
}
}
min_bucket 是取最小桶,同理 max_bucket 取最大桶。
sum_bucket 桶求和
查询 10 ~ 20 岁 和 20 ~ 30 岁 和 其他年龄段的年龄统计数据(max、min、sum、avg、count),并计算平均年龄之和
{
"size": 0,
"aggs" :{
"state_f":{
"filters":{
"other_bucket_key": "others",
"filters":{
"a1020":{
"range":{"age":{"gte":"10","lt":"20"}}
},
"a2030":{
"range":{"age":{"gte":"20","lt":"30"}}
}
}
},
"aggs":{
"state_age":{"stats":{"field":"age"}}
}
},
"sss":{
"sum_bucket":{
"buckets_path": "state_f > state_age.avg"
}
}
}
}
结果
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1003,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"state_f": {
"buckets": {
"a1020": {
"doc_count": 219,
"state_age": {
"count": 219,
"min": 10.0,
"max": 19.0,
"avg": 14.337899543378995,
"sum": 3140.0
}
},
"a2030": {
"doc_count": 268,
"state_age": {
"count": 268,
"min": 20.0,
"max": 29.0,
"avg": 24.473880597014926,
"sum": 6559.0
}
},
"others": {
"doc_count": 516,
"state_age": {
"count": 516,
"min": 30.0,
"max": 50.0,
"avg": 39.9031007751938,
"sum": 20590.0
}
}
}
},
"sss": {
"value": 78.71488091558771
}
}
}
注意:min_bucket 、max_bucket 、sum_bucket 都是聚合同级的桶。
bucket_script 聚合(桶)脚本
示例:查询男女的最大年龄差。即:男最大年龄 - 男最小年龄;女最大年龄 - 最小年龄;
{
"size": 0,
"aggs" :{
"group_sex":{
"terms":{
"field":"sex"
},
"aggs":{
"stats_age":{
"stats":{
"field":"age"
}
},
"sex-percentage": {
"bucket_script": {
"buckets_path": {
"age_min": "stats_age.min",
"age_max": "stats_age.max"
},
"script": "params.age_max - params.age_min"
}
}
}
}
}
}
这里用到了 bucket_script 来进行相关的计算返回,使用是注意 bucket_script 只能用在子聚合中,注意层次,其次需要注意 bucket_script 的计算结果只能是数字类型。关于脚本的详细讲解,我们会在后面具体章节来说。这里 bucket_script 把 buckets_path 定义的字段作为参数在 script 脚本中进行计算
执行结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1003,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"group_sex": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 511,
"stats_age": {
"count": 511,
"min": 10.0,
"max": 50.0,
"avg": 30.031311154598825,
"sum": 15346.0
},
"sex-percentage": {
"value": 40.0
}
},
{
"key": 0,
"doc_count": 492,
"stats_age": {
"count": 492,
"min": 10.0,
"max": 50.0,
"avg": 30.371951219512194,
"sum": 14943.0
},
"sex-percentage": {
"value": 40.0
}
}
]
}
}
}
由于我们数据的特殊原因,此处计算出的男女的年龄差都是40
bucket_selector 过滤桶聚合
查询 10 ~ 20 岁 和 20 ~ 30 岁 和 其他年龄段的年龄统计数据(max、min、sum、avg、count),并筛选平均年龄大于20的桶
{
"size": 0,
"aggs" :{
"state_f":{
"filters":{
"other_bucket_key": "others",
"filters":{
"a1020":{
"range":{"age":{"gte":"10","lt":"20"}}
},
"a2030":{
"range":{"age":{"gte":"20","lt":"30"}}
}
}
},
"aggs":{
"state_age":{"stats":{"field":"age"}},
"age_bucket_filter":{
"bucket_selector":{
"buckets_path":{
"bucket_avg_age":"state_age.avg"
},
"script": "params.bucket_avg_age > 20"
}
}
}
}
}
}
使用 bucket_selector 来实现分组的过滤,此方式类似于 sql 的 having 查询,script 脚本中的 20 可以定义为脚本参数,我们甚至可以预定义脚本,这都是我们后续脚本编写一章中再说。
结果
{
"took": 182,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1003,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"state_f": {
"buckets": {
"a2030": {
"doc_count": 268,
"state_age": {
"count": 268,
"min": 20.0,
"max": 29.0,
"avg": 24.473880597014926,
"sum": 6559.0
}
},
"others": {
"doc_count": 516,
"state_age": {
"count": 516,
"min": 30.0,
"max": 50.0,
"avg": 39.9031007751938,
"sum": 20590.0
}
}
}
}
}
}
注意我们的示例其实没有实际的意义,只为了讲解 bucket_selector 的用法而已,实际中,可能是计算商品总金额超过800w的某些类型的商品。