聚合查询
Elasticsearch 聚合
Elasticsearch 聚合2
Elasticsearch三种聚合 Metrics Aggregations、Bucket Aggregations、Pipeline Aggregations中的常用聚合。
- Metrics Aggregations 度量聚合
如Count、Sum、Min、Max、Avg、Count(Distinct)就是度量。 - Bucket Aggregations 分桶聚合
如 Group by country,每个country就是一个桶,也可以叫做一个分组。可对每个分组内的数据进行聚合。 - Pipeline Aggregations 管道聚合
管道聚合,基于现有的聚合结果,再进行聚合。
1.数据准备
创建索引
PUT user_logs
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"recharge_log": {
"properties": {
"uid": {
"type": "keyword"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age": {
"type": "integer"
},
"country": {
"type": "keyword"
},
"payTime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"payWay": {
"type": "integer"
},
"money": {
"type": "integer"
}
}
}
}
}
插入数据
POST /user_logs/recharge_log/_bulk
{"index": {}}
{"uid": "1", "country": "US", "age": 26, "payWay": 2, "money": 30, "payTime": "2016-08-25 08:05:16", "name": "Rose petal"}
{"index": {}}
{"uid": "1", "country": "US", "age": 26, "payWay": 1, "money": 20, "payTime": "2016-08-26 08:05:16", "name": "Rose petal"}
{"index": {}}
{"uid": "1", "country": "US", "age": 26, "payWay": 1, "money": 30, "payTime": "2016-08-27 10:05:16", "name": "Rose petal"}
{"index": {}}
{"uid": "2", "country": "CN", "age": 23, "payWay": 2, "money": 20, "payTime": "2016-08-25 08:05:16", "name": "Belen Rose"}
{"index": {}}
{"uid": "2", "country": "CN", "age": 23, "payWay": 2, "money": 20, "payTime": "2016-08-26 08:05:16", "name": "Belen Rose"}
{"index": {}}
{"uid": "2", "country": "CN", "age": 23, "payWay": 1, "money": 20, "payTime": "2016-08-27 10:05:16", "name": "Belen Rose"}
{"index": {}}
{"uid": "3", "country": "CN", "age": 29, "payWay": 2, "money": 20, "payTime": "2016-08-25 08:05:16", "name": "Rose petal"}
{"index": {}}
{"uid": "3", "country": "CN", "age": 29, "payWay": 2, "money": 20, "payTime": "2016-08-26 08:05:16", "name": "Rose petal"}
{"index": {}}
{"uid": "3", "country": "CN", "age": 29, "payWay": 1, "money": 10, "payTime": "2016-08-27 10:05:16", "name": "Rose petal"}
只返回聚合结果
设置size=0,只返回聚合结果,不返回搜索结果。
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"avg_money": {
"avg": {
"field": "money"
}
}
}
}
#返回
{
"took": 895,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"avg_money": {
"value": 21.11111111111111
}
}
}
同时返回聚合类型
添加typed_keys参数,在返回聚合结果时,可同时返回聚合类型。
如下avg#avg_money,在聚合名称(avg_money)前添加聚合类型(avg)前缀,并一并返回。
#查询
GET user_logs/recharge_log/_search?size=0&typed_keys
{
"query": {
"match_all": {}
},
"aggs": {
"avg_money": {
"avg": {
"field": "money"
}
}
}
}
#返回
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"avg#avg_money": {
"value": 21.11111111111111
}
}
}
2.度量聚集
2.1 max/min/sum/avg
查询结果中某字段的最大值。
#1.查询:单笔充值最高值
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"max_money":{
"max": {
"field": "money"
}
}
}
}
max_money是我们自定义的名称,用来接收聚合结果
#2.查询:最小充值金额
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"min_money":{
"min": {
"field": "money"
}
}
}
}
#3.查询:总充值金额
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"sum_money":{
"sum": {
"field": "money"
}
}
}
}
##4.查询:平均充值金额=总充值金额/充值订单数
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"avg_money":{
"avg": {
"field": "money"
}
}
}
}
2.2 Value Count
相似于count(列)
值计数聚合。计算聚合文档中某个值(可以是特定的数值型字段,也可以通过脚本计算而来)的个数。该聚合一般域其它 single-value 聚合联合使用,比如在计算一个字段的平均值的时候,可能还会关注这个平均值是由多少个值计算而来。
#查询:文档总数
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"value_count_docs": {
"value_count": {
"field": "_id"
}
}
}
}
2.3 去重计数聚合(Cardinality Aggregation)
计算查询结果中某字段值的去重计数。
去重计数聚合基于HyperLogLog++ 算法,这个算法基于散列值,可通过precision_threshold控制精度。
precision_threshold默认值3000,最大值40000。
#查询:充值uv
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"recharge_uv":{
"cardinality": {
"field": "uid",
"precision_threshold": 40000
}
}
}
}
#返回
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"recharge_uv": {
"value": 3
}
}
}
最终的值有3个,即[1,2,3]
2.4 统计信息聚合(Stats Aggregation)
Stats Aggregation,统计聚合。它属于multi-value,基于文档的某个值(可以是特定的数值型字段,也可以通过脚本计算而来),计算出一些统计信息(min、max、sum、count、avg5个值)。
计算查询结果中某字段的统计数据。
#查询:充值金额统计
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"stats_money":{
"stats": {
"field": "money"
}
}
}
}
#返回
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"stats_money": {
"count": 9,
"min": 10,
"max": 30,
"avg": 21.11111111111111,
"sum": 190
}
}
}
2.5 百分位数聚合(Percentiles Aggregation)
计算查询结果中某字段的百分位数。可用来评估字段值分布。
如用来评估请求SLA延迟是否达标。每条请求日志都包含请求延时字段,可以查看请求延时99百分位数,如果99百分位数的值在100ms以内则说明请求在100ms内响应的占比99%,SLA达标,否则SLA不达标。
如下,返回结果50.0:20,意味着,单笔充值金额在20以内的占比50%。
默认百分位数1,5,25,50,75,95,99。
注意:
百分位数也可通过numpy取得。如 print numpy.percentile(numpy.array([5,10,18]),99)
常用平均值度量,但平均值容易受异常最大最小值影响,
#查询:
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"percentiles_money":{
"percentiles": {
"field": "money",
"percents": [
50,
99
]
}
}
}
}
#返回
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"percentiles_money": {
"values": {
"50.0": 20, //单笔充值金额在20以内的占比50%
"99.0": 30 //单笔充值金额在30以内的占比99%
}
}
}
}
3.分桶聚合(Bucket Aggregations)
它执行的是对文档分组的操作(与sql中的group by类似),把满足相关特性的文档分到一个桶里,即桶分,输出结果往往是一个个包含多个文档的桶(一个桶就是一个group)。
它有一个关键字(field、script),以及一些桶分(分组)的判断条件。执行聚合操作时候,文档会判断每一个分组条件,如果满足某个,该文档就会被分为该组(fall in)。
它不进行权值的计算,他们对文档根据聚合请求中提供的判断条件(比如:{“from”:0, “to”:100})来进行分组(桶分)。桶聚合还会额外返回每一个桶内文档的个数。
3.1 Terms Aggregation 词聚合
Terms Aggregation,词聚合。基于某个field,该 field 内的每一个【唯一词元】为一个桶,并计算每个桶内文档个数。默认返回顺序是按照文档个数多少排序。它属于multi-bucket
当不返回所有 buckets 的情况(它size控制),文档个数可能不准确。
#每种充值方式对应的充值总额
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"terms_payWay": {
"terms": {
"field": "payWay"
}
}
}
}
##返回
{
"took": 379,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"terms_payWay": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 2,
"doc_count": 5
},
{
"key": 1,
"doc_count": 4
}
]
}
}
}
##扩展
{
"aggs" : {
"age_terms" : {
"terms" : {
"field" : "age",
"size" : 10, //size用来定义需要返回多个 buckets(防止太多),默认会全部返回。
"order" : { "_count" : "asc" }, //根据文档计数排序,根据分组值排序({ "_key" : "asc" })
"min_doc_count": 10, //只返回文档个数不小于该值的 buckets
"include" : ".*sport.*", //包含过滤
"exclude" : "water_.*", //排除过滤
"missing": "N/A"
}
}
}
}
3.2 单过滤聚合(Filter Aggregation)
对Quey结果单次Filter后形成的新的Bucket进行聚合。是单Bucket聚合。
POST user_logs/recharge_log/_search?size=0
{
"query": {"match_all": {}},
"aggs": {
"payWay2_agg":{
"filter": {"term": {"payWay": 2}}, //先过滤出来payWay为2的数据
"aggs": {"sum_money_payWay2": {"sum": {"field":"money"}}} //对过滤出的数据,根据money字段进行分组
}
}
}
(1)先过滤出来payWay为2的数据,显示有5条数据
(2)对过滤出来的5条数据,根据money字段进行分组
#返回
"aggregations": {
"payWay2_agg": {
"doc_count": 5,
"sum_money_payWay2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 23,
"doc_count": 2
},
{
"key": 29,
"doc_count": 2
},
{
"key": 26,
"doc_count": 1
}
]
}
}
}
进阶:
{
"query": {
"match_all": {}
},
"aggs": {
"payWay2_agg": {
"filter": {
"term": {
"payWay": 2 //1.过滤出payWay为2的数据,一共有5条数据,显示payWay2_agg的doc_count为5
}
},
"aggs": {
"sum_money_payWay2": {
"terms": {
"field": "age" //2.对这5条数据,以age分组,一共分为3组,23有2个.26有1个,29有2个
},
"aggs": {
"filter_CN":{
"filter": {
"term": {
"country":"CN" //3.对这三个组中的数据进行过滤,在每个桶中筛选出country为CN的数据,23的filter_CN的doc_count是2,26的filter_CN的doc_count是0,29的filter_CN的doc_count是2
}
},
"aggs": {
"uid_count": {
"terms": {
"field": "uid" //4.三个桶都过滤出了数据,此时对每个桶中的数据进行分组
}
}
}
}
}
}
}
}
}
}
3.3 多过滤聚合(Filters Aggregation)
对Quey结果多次Filter后形成的多个Bucket分别聚合。是多Bucket聚合。
POST user_logs/recharge_log/_search?size=0
{
"query": {"match_all": {}},
"aggs": {
"flters_agg": {
"filters": {
"filters": {
"uid1": {"term": {"uid": 1}},
"uid2":{"term": {"uid": 2}}
}
},
"aggs": {
"payWay": {
"terms": {
"field": "payWay"
}
}
}
}
}
}
总结:Filter,Filters,terms的区别
## 1.单过滤聚合
{
"query": {"match_all": {}},
"aggs": {
"flters_agg": {
"filter":{
"term": {"uid": 1}
}
}
}
}
## 结果:过滤出了uid为1的数据有3个
{
"aggregations": {
"flters_agg": {
"doc_count": 3
}
}
}
## 2.多过滤聚合
{
"query": {"match_all": {}},
"aggs": {
"flters_agg": {
"filters":{
"filters":{
"uid1": {"term": {"uid": 1}},
"uid2":{"term": {"uid": 2}}
}
}
}
}
}
## 结果:过滤出了uid为1的有3个,uid为2的也有3个(等于有两个桶了,本质上就是桶)
{
"aggregations": {
"flters_agg": {
"buckets": {
"uid1": {
"doc_count": 3
},
"uid2": {
"doc_count": 3
}
}
}
}
}
## 3.唯一值聚合
{
"query": {"match_all": {}},
"aggs": {
"terms_agg":{
"terms":{
"field":"uid"
}
}
}
}
## 结果,我们看到返回结果和多过滤聚合的返回结果是差不多的,不过多过滤的buckets是对象(对象有属性叫uid1和uid2),而唯一值聚合的buckets是数组
{
"aggregations": {
"terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 3
},
{
"key": "2",
"doc_count": 3
},
{
"key": "3",
"doc_count": 3
}
]
}
}
}
3.4 范围聚合(Range Aggregation)
自定义一系列范围,每个范围代表一个分桶。
GET user_logs/recharge_log/_search?size=0
{
"query": {
"range": {
"age": {
"gte": 20,
"lte": 36
}
}
},
"aggs": {
"age_range_agg": {
"range": {
"field": "age",
"ranges": [
{
"to": 30
},
{
"from": 25,
"to": 30
}
]
}
}
}
}
##结果如下:
{
"aggregations": {
"age_range_agg": {
"buckets": [
{
"key": "*-30.0", //表示0-30的有9个
"to": 30,
"doc_count": 9
},
{
"key": "25.0-30.0", //表示25-30的有6个
"from": 25,
"to": 30,
"doc_count": 6
}
]
}
}
}
##扩展:
{
"query": {
"range": {
"age": {
"gte": 20,
"lte": 36
}
}
},
"aggs": {
"age_range_agg": {
"range": {
"field": "age",
"ranges": [{
"to": 30
},
{
"from": 25,
"to": 30
}
]
},
"aggs": {
"country": {
"terms": {
"field": "country"
}
}
}
}
}
}
#结果如下
{
"aggregations": {
"age_range_agg": {
"buckets": [
{
"key": "*-30.0",
"to": 30,
"doc_count": 9,
"xxx": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "CN",
"doc_count": 6
},
{
"key": "US",
"doc_count": 3
}
]
}
},
{
"key": "25.0-30.0",
"from": 25,
"to": 30,
"doc_count": 6,
"xxx": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "CN",
"doc_count": 3
},
{
"key": "US",
"doc_count": 3
}
]
}
}
]
}
}
}
3.5 日期范围聚合
对日期字段按日期范围聚合。是多Bucket聚合。
同范围聚合(Range Aggregation)相比,日期范围聚合可用ES 内置日期格式或JodaDate日期表达式来表示日期范围。
format指定输入和输出日期格式。
GET /user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"date_range_payTime": {
"date_range": {
"field": "payTime",
"format": "epoch_second",
"ranges": [
{
"from": 1472083200000,
"to": 1472169600000
},
{
"from": 1472169600000,
"to": 1472256000000
}
]
}
}
}
}
## 返回值
"aggregations": {
"date_range_payTime": {
"buckets": [
{
"key": "1472083200-1472169600", //2016-08-25 08:00:00 ~ 2016-08-26 08:00:00 的文档有3个
"from": 1472083200000,
"from_as_string": "1472083200",
"to": 1472169600000,
"to_as_string": "1472169600",
"doc_count": 3
},
{
"key": "1472169600-1472256000", //2016-08-26 08:00:00 ~ 2016-08-27 08:00:00 的文档有三个
"from": 1472169600000,
"from_as_string": "1472169600",
"to": 1472256000000,
"to_as_string": "1472256000",
"doc_count": 3
}
]
}
}
## 扩展 使用JodaDate日期表达式
{
"query": {
"match_all": {}
},
"aggs": {
"date_range_payTime": {
"date_range": {
"field": "payTime",
"format": "yyyy-MM-dd",
"ranges": [
{
"to": "now-3y" //当前时间2019/10/15日 往前推3年,即2016/10/15
},
{
"from": "now-3y"
}
]
}
}
}
}
结果如下
"aggregations": {
"date_range_payTime": {
"buckets": [
{
"key": "*-2016-10-15", //2016-10-15以前的文档有9个
"to": 1476516729585,
"to_as_string": "2016-10-15",
"doc_count": 9
},
{
"key": "2016-10-15-*", //2016-10-15以后的文档有0个
"from": 1476516729585,
"from_as_string": "2016-10-15",
"doc_count": 0
}
]
}
}
如果想让时间设定为月初开始,即2016/10/1
{
"query": {
"match_all": {}
},
"aggs": {
"date_range_payTime": {
"date_range": {
"field": "payTime",
"format": "yyyy-MM-dd",
"ranges": [
{
"to": "now-3y/M" //当前时间2019/10/15日 往前推3年的当月第一天,即2016/10/1
},
{
"from": "now-3y/M"
}
]
}
}
}
}
可参考:date range
3.6 直方图聚合(Histogram Aggregation)
#value:字段值
#interval:间隔
#rem:取余
#举例:
#如 interval=10
#当 value=10 时,rem=0,则bucket_key=10
rem = value % interval
if (rem < 0) {
rem += interval
}
bucket_key = value - rem
#查询:每个充值区间的订单数
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"money_histogram": {
"histogram": {
"field": "money", //必须为数值类型
"interval": 10, //分桶间距
"min_doc_count" : 1, //最少文档数桶过滤,只有不少于这么多文档的桶才会返回
"extended_bounds" : { //范围扩展
"min" : 0,
"max" : 500
},
"order" : { "_count" : "desc" },//对桶排序,如果 histogram 聚合有一个权值聚合类型的"直接"子聚合,那么排序可以使用子聚合中的结果
"keyed":true, //hash结构返回,默认以数组形式返回每一个桶
"missing":0 //配置缺省默认值
}
}
}
}
3.7 全局聚合(Global Aggregation)
全局聚合不受Query的影响。是单Bucket聚合。
#全部年龄充值总金额和query年龄段充值总金额。
{
"query": {"range": {"age": {"gte": 10,"lte": 25}}},
"aggs": {
"all_age_sum_money": {
"global": {}, //all_age_sum_money使用全局聚合
"aggs": {
"sum_money": {
"sum": {
"field": "money"
}
}
}
},
"this_age_sum_money":{
"sum": { //this_age_sum_money不使用全局聚合
"field": "money"
}
}
}
}
#返回结果
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"all_age_sum_money": { //全局聚合结果-不受Query影响
"doc_count": 9,
"sum_money": {
"value": 190
}
},
"this_age_sum_money": { //Query聚合结果
"value": 60
}
}
}
3.8 缺失值聚合(Missing Aggregation)
对Query结果中某缺失字段统计。是单Bucket聚合。
#统计没有payWay字段或payWay字段为NULL的文档数。
GET user_logs/recharge_log/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"missing_payWay": {
"missing": {
"field": "payWay"
}
}
}
}
#返回结果
{
"took": 23,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"missing_payWay": {
"doc_count": 0
}
}
}
管道聚合
管道聚合分为两类:
-
Sibling
基于同级聚合结果再进行聚合。 -
Parent
基于父级聚合结果再进行聚合。
1.桶均值聚合(Avg Bucket Aggregation)
同级管道聚合,计算同级聚合中所有桶指定度量的均值。
先对日期进行分组,每一天的数据一个桶,之后计算每个桶的充值金额总数
#每天充值金额与平均每天充值金额
GET user_logs/recharge_log/_search?size=0
{
"query": {"match_all": {}},
"aggs": {
"date_agg": { //对日期进行聚合,此处的day表示天
"date_histogram": {
"field": "payTime",
"interval": "day"
},
"aggs": { //同级聚合,和date_histogram同级,日期分桶后,计算桶中的金额总数
"day_avg_money": {
"avg": {
"field": "money"
}
}
}
},
"pipeline_avg_money":{ //平均每天充值金额
"avg_bucket": {
"buckets_path": "date_agg>day_sum_money"
}
}
}
}
## 返回结果
{
"took": 182,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"date_agg": {
"buckets": [
{
"key_as_string": "2016-08-25 00:00:00",
"key": 1472083200000,
"doc_count": 3,
"day_sum_money": {
"value": 70
}
},
{
"key_as_string": "2016-08-26 00:00:00",
"key": 1472169600000,
"doc_count": 3,
"day_sum_money": {
"value": 60
}
},
{
"key_as_string": "2016-08-27 00:00:00",
"key": 1472256000000,
"doc_count": 3,
"day_sum_money": {
"value": 60
}
}
]
},
"pipeline_avg_money": {
"value": 63.333333333333336
}
}
}