文章目录
定义
ES除了检索之外,还提供对数据进行的统计分析功能,实时性比较高
Bucket聚合
筛选出满足特定条件文档:
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"dest": {
"terms": {
"field": "DestCountry"
}
}
}
}
输出如下,可以看到aggregation中飞往各个国家的航班数:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"dest" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3187,
"buckets" : [
{
"key" : "IT",
"doc_count" : 2371
},
{
"key" : "US",
"doc_count" : 1987
},
{
"key" : "CN",
"doc_count" : 1096
},
{
"key" : "CA",
"doc_count" : 944
},
{
"key" : "JP",
"doc_count" : 774
},
{
"key" : "RU",
"doc_count" : 739
},
{
"key" : "CH",
"doc_count" : 691
},
{
"key" : "GB",
"doc_count" : 449
},
{
"key" : "AU",
"doc_count" : 416
},
{
"key" : "PL",
"doc_count" : 405
}
]
}
}
}
Metric聚合
提供了一些数学运算,可以对文档字段进行统计分析
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"dest": {
"terms": {
"field": "DestCountry"
},
"aggs": {
"avg_price": {
"avg": {
"field": "AvgTicketPrice"
}
},
"max_price": {
"max": {
"field": "AvgTicketPrice"
}
},
"min_price": {
"min": {
"field": "AvgTicketPrice"
}
}
}
}
}
}
输出如下,可以看到飞往各国航班票价的均值和最大最小值:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"dest" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3187,
"buckets" : [
{
"key" : "IT",
"doc_count" : 2371,
"max_price" : {
"value" : 1195.3363037109375
},
"min_price" : {
"value" : 100.57646942138672
},
"avg_price" : {
"value" : 586.9627099618385
}
},
{
"key" : "US",
"doc_count" : 1987,
"max_price" : {
"value" : 1199.72900390625
},
"min_price" : {
"value" : 100.14596557617188
},
"avg_price" : {
"value" : 595.7743908825026
}
},
{
"key" : "CN",
"doc_count" : 1096,
"max_price" : {
"value" : 1198.4901123046875
},
"min_price" : {
"value" : 102.90382385253906
},
"avg_price" : {
"value" : 640.7101617033464
}
},
{
"key" : "CA",
"doc_count" : 944,
"max_price" : {
"value" : 1198.8525390625
},
"min_price" : {
"value" : 100.5572509765625
},
"avg_price" : {
"value" : 648.7471090413757
}
},
{
"key" : "JP",
"doc_count" : 774,
"max_price" : {
"value" : 1199.4913330078125
},
"min_price" : {
"value" : 103.97209930419922
},
"avg_price" : {
"value" : 650.9203447346847
}
},
{
"key" : "RU",
"doc_count" : 739,
"max_price" : {
"value" : 1196.7423095703125
},
"min_price" : {
"value" : 101.0040054321289
},
"avg_price" : {
"value" : 662.9949632162009
}
},
{
"key" : "CH",
"doc_count" : 691,
"max_price" : {
"value" : 1196.496826171875
},
"min_price" : {
"value" : 101.3473129272461
},
"avg_price" : {
"value" : 575.1067587028537
}
},
{
"key" : "GB",
"doc_count" : 449,
"max_price" : {
"value" : 1197.78564453125
},
"min_price" : {
"value" : 111.34574890136719
},
"avg_price" : {
"value" : 650.5326856005696
}
},
{
"key" : "AU",
"doc_count" : 416,
"max_price" : {
"value" : 1197.6326904296875
},
"min_price" : {
"value" : 102.2943115234375
},
"avg_price" : {
"value" : 669.5588319668403
}
},
{
"key" : "PL",
"doc_count" : 405,
"max_price" : {
"value" : 1185.43701171875
},
"min_price" : {
"value" : 104.28328704833984
},
"avg_price" : {
"value" : 662.4497233072917
}
}
]
}
}
}
Pipeline聚合
对其他聚合结果进行二次聚合。
上面的统计票价就是一个聚合的例子,下面给出一个统计天气的例子:
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"dest": {
"terms": {
"field": "DestCountry"
},
"aggs": {
"weather": {
"terms": {
"field": "DestWeather"
}
}
}
}
}
}
结果如下:
{
"took" : 18,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"dest" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3187,
"buckets" : [
{
"key" : "IT",
"doc_count" : 2371,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Clear",
"doc_count" : 428
},
{
"key" : "Sunny",
"doc_count" : 424
},
{
"key" : "Rain",
"doc_count" : 417
},
{
"key" : "Cloudy",
"doc_count" : 414
},
{
"key" : "Heavy Fog",
"doc_count" : 182
},
{
"key" : "Damaging Wind",
"doc_count" : 173
},
{
"key" : "Hail",
"doc_count" : 169
},
{
"key" : "Thunder & Lightning",
"doc_count" : 164
}
]
}
},
{
"key" : "US",
"doc_count" : 1987,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Rain",
"doc_count" : 371
},
{
"key" : "Clear",
"doc_count" : 346
},
{
"key" : "Sunny",
"doc_count" : 345
},
{
"key" : "Cloudy",
"doc_count" : 330
},
{
"key" : "Heavy Fog",
"doc_count" : 157
},
{
"key" : "Thunder & Lightning",
"doc_count" : 155
},
{
"key" : "Hail",
"doc_count" : 142
},
{
"key" : "Damaging Wind",
"doc_count" : 141
}
]
}
},
{
"key" : "CN",
"doc_count" : 1096,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Sunny",
"doc_count" : 209
},
{
"key" : "Rain",
"doc_count" : 207
},
{
"key" : "Clear",
"doc_count" : 192
},
{
"key" : "Cloudy",
"doc_count" : 173
},
{
"key" : "Thunder & Lightning",
"doc_count" : 86
},
{
"key" : "Hail",
"doc_count" : 81
},
{
"key" : "Heavy Fog",
"doc_count" : 79
},
{
"key" : "Damaging Wind",
"doc_count" : 69
}
]
}
},
{
"key" : "CA",
"doc_count" : 944,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Clear",
"doc_count" : 197
},
{
"key" : "Rain",
"doc_count" : 173
},
{
"key" : "Cloudy",
"doc_count" : 156
},
{
"key" : "Sunny",
"doc_count" : 148
},
{
"key" : "Damaging Wind",
"doc_count" : 80
},
{
"key" : "Thunder & Lightning",
"doc_count" : 69
},
{
"key" : "Heavy Fog",
"doc_count" : 62
},
{
"key" : "Hail",
"doc_count" : 59
}
]
}
},
{
"key" : "JP",
"doc_count" : 774,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Rain",
"doc_count" : 152
},
{
"key" : "Sunny",
"doc_count" : 138
},
{
"key" : "Clear",
"doc_count" : 130
},
{
"key" : "Cloudy",
"doc_count" : 123
},
{
"key" : "Damaging Wind",
"doc_count" : 66
},
{
"key" : "Heavy Fog",
"doc_count" : 58
},
{
"key" : "Thunder & Lightning",
"doc_count" : 57
},
{
"key" : "Hail",
"doc_count" : 50
}
]
}
},
{
"key" : "RU",
"doc_count" : 739,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Cloudy",
"doc_count" : 149
},
{
"key" : "Rain",
"doc_count" : 128
},
{
"key" : "Clear",
"doc_count" : 122
},
{
"key" : "Sunny",
"doc_count" : 117
},
{
"key" : "Thunder & Lightning",
"doc_count" : 62
},
{
"key" : "Hail",
"doc_count" : 56
},
{
"key" : "Damaging Wind",
"doc_count" : 55
},
{
"key" : "Heavy Fog",
"doc_count" : 50
}
]
}
},
{
"key" : "CH",
"doc_count" : 691,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Cloudy",
"doc_count" : 135
},
{
"key" : "Sunny",
"doc_count" : 134
},
{
"key" : "Clear",
"doc_count" : 128
},
{
"key" : "Rain",
"doc_count" : 115
},
{
"key" : "Heavy Fog",
"doc_count" : 51
},
{
"key" : "Hail",
"doc_count" : 46
},
{
"key" : "Damaging Wind",
"doc_count" : 41
},
{
"key" : "Thunder & Lightning",
"doc_count" : 41
}
]
}
},
{
"key" : "GB",
"doc_count" : 449,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Rain",
"doc_count" : 93
},
{
"key" : "Sunny",
"doc_count" : 81
},
{
"key" : "Clear",
"doc_count" : 77
},
{
"key" : "Cloudy",
"doc_count" : 71
},
{
"key" : "Heavy Fog",
"doc_count" : 34
},
{
"key" : "Hail",
"doc_count" : 32
},
{
"key" : "Damaging Wind",
"doc_count" : 31
},
{
"key" : "Thunder & Lightning",
"doc_count" : 30
}
]
}
},
{
"key" : "AU",
"doc_count" : 416,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Rain",
"doc_count" : 80
},
{
"key" : "Cloudy",
"doc_count" : 75
},
{
"key" : "Clear",
"doc_count" : 73
},
{
"key" : "Sunny",
"doc_count" : 57
},
{
"key" : "Hail",
"doc_count" : 38
},
{
"key" : "Thunder & Lightning",
"doc_count" : 34
},
{
"key" : "Heavy Fog",
"doc_count" : 32
},
{
"key" : "Damaging Wind",
"doc_count" : 27
}
]
}
},
{
"key" : "PL",
"doc_count" : 405,
"weather" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Clear",
"doc_count" : 74
},
{
"key" : "Rain",
"doc_count" : 71
},
{
"key" : "Cloudy",
"doc_count" : 67
},
{
"key" : "Sunny",
"doc_count" : 66
},
{
"key" : "Thunder & Lightning",
"doc_count" : 37
},
{
"key" : "Damaging Wind",
"doc_count" : 30
},
{
"key" : "Hail",
"doc_count" : 30
},
{
"key" : "Heavy Fog",
"doc_count" : 30
}
]
}
}
]
}
}
}
再看一个使用buckets_path指定聚合的层级关系的例子,其中min_bucket求的是最小值:
POST employees/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
},
"min_salary_by_job": {
"min_bucket": {
"buckets_path": "jobs>avg_salary"
}
}
}
}
以上实现的效果就是找平均工资最低的工作类型,输出的最后就是结果
"min_salary_by_job" : {
"value" : 19250.0,
"keys" : [
"Javascript Programmer"
]
}
把上面的min换成max,就是求最高的了;把min换成avg,就是求桶的均值;换成stats,就是输出最大最小均值等信息。
Matrix聚合
对多个字段进行操作并提供一个结果矩阵
聚合的作用范围
默认范围是query的查询结果集,同时还支持以下方式改变聚合的作用范围:Filter、PostFieldr和Global。
先改变employees索引的Mapping:
DELETE /employees
PUT /employees/
{
"mappings": {
"properties": {
"age": {
"type": "integer"
},
"gender": {
"type": "keyword"
},
"job": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 50
}
}
},
"name": {
"type": "keyword"
},
"salary": {
"type": "integer"
}
}
}
}
然后插入数据:
PUT /employees/_bulk
{ "index" : { "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : { "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : { "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : { "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : { "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : { "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : { "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : { "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : { "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : { "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : { "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : { "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : { "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : { "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : { "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : { "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : { "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : { "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
作用范围为query结果集
默认使用query结果集作为作用范围的示例如下
POST employees/_search
{
"size": 0,
"query": {
"range": {
"age": {
"gte": 30
}
}
},
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword"
}
}
}
}
输出会看到每个桶的信息
{
....
"hits" : {
"total" : {
"value" : 10,
"relation" : "eq"
},
...
},
"aggregations" : {
"jobs" : {
...
"buckets" : [
{
"key" : "Java Programmer",
"doc_count" : 5
},
{
"key" : "DBA",
"doc_count" : 1
},
....
]
}
}
}
通过filter改变作用范围
使用filter改变aggs作用范围的示例如下:
POST employees/_search
{
"size": 0,
"aggs": {
"older_person": {
"filter": {
"range": {
"age": {
"from": 35
}
}
},
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword"
}
}
}
},
"all_jobs": {
"terms": {
"field": "job.keyword"
}
}
}
}
里面的filter就只会让older_person这个聚合只作用与年龄≥35的数据中,下面的all_jobs是一个对照组。older_person的输出如下
"older_person" : {
"doc_count" : 2,
"jobs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Dev Manager",
"doc_count" : 1
},
{
"key" : "Java Programmer",
"doc_count" : 1
}
]
}
通过post_filter改变作用范围
使用post_filter的改变聚合作用域的示例如下,它的作用是筛选出聚合中符合条件的数据
POST employees/_search
{
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword"
}
}
},
"post_filter": {
"match": {
"job.keyword": "Dev Manager"
}
}
}
输出如下,会从聚合结果中选出Dev Manager
{
....
"hits" : {
...
"hits" : [
{
"_index" : "employees",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "Underwood",
"age" : 41,
"job" : "Dev Manager",
"gender" : "male",
"salary" : 50000
}
}
]
},
"aggregations" : {
"jobs" : {
...
"buckets" : [
{
"key" : "Java Programmer",
"doc_count" : 7
},
{
"key" : "Javascript Programmer",
"doc_count" : 4
},
....
{
"key" : "Dev Manager",
"doc_count" : 1
},
...
]
}
}
}
global全局聚合
使用global进行全局聚合的示例如下,它会忽视query所限定的条件
POST employees/_search
{
"size": 0,
"query": {
"range": {
"age": {
"gte": 40
}
}
},
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword"
}
},
"all": {
"global": {},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
}
}
}
输出如下,可见jobs聚合会输出≥40岁员工的工作类型,而all则会输出所有人的平均工资
"aggregations" : {
"all" : {
"doc_count" : 20,
"avg_salary" : {
"value" : 24700.0
}
},
"jobs" : {
...
"buckets" : [
{
"key" : "Dev Manager",
"doc_count" : 1
}
]
}
}
排序
聚合排序,通过指定order来进行排序。以下示例查询年龄≥20的文档,并对聚合结果先按照文档数升序、再按照工作类型降序的方式排序
POST employees/_search
{
"size": 0,
"query": {
"range": {
"age": {
"gte": 20
}
}
},
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"order": [
{"_count": "asc"},
{"_key": "desc"}
]
}
}
}
}
输出结果如下
"aggregations" : {
"jobs" : {
...
"buckets" : [
{
"key" : "Product Manager",
"doc_count" : 1
},
{
"key" : "Dev Manager",
"doc_count" : 1
},
{
"key" : "Web Designer",
"doc_count" : 2
},
{
"key" : "DBA",
"doc_count" : 2
},
{
"key" : "QA",
"doc_count" : 3
},
{
"key" : "Javascript Programmer",
"doc_count" : 4
},
{
"key" : "Java Programmer",
"doc_count" : 7
}
]
}
}
原理和精准度
以min聚合为例,其执行流程如下图所示,其实就是三个结点的主分片求最小值,最后再聚合一下
terms的执行流程如下图所示,现在各个结点中拿到前三个,然后把这九个再取前三个进行聚合,得到结果
但这种方法得到的结果不一定对,请看下面的例子:
上图中,左边返回的是ABC,右边返回的是ABD,汇总后再取前三位就是A(12)、B(6)、C(4),但是D其实有6个文档,而最终被排除在外,所以就出错了。另外,图中doc_count_error_upon_bound表示分桶中被遗漏的文档总数最大值,左边分桶最大被遗漏一个4,右边最大被遗漏一个3,所以被遗漏的文档总数最大就是7;而sum_other_doc_count表示除了最终返回的文档,剩下的文档总数,图中总共返回了12 + 6 + 4 = 22个,那么剩下的就是7个
如何解决terms不准的问题呢,通过把主分片数设置为1或者提高shard_size数(这样可以从分片中获取更多的数据)即可: