写在前面
本文看下es的聚合相关内容。
1:什么是聚合
即,数据的统计分析。如sum,count,avg,min,max,分组等。
2:支持哪些聚合类型
2.1:bucket aggregation
对满足特定条件的文旦进行分组。如:
支持如下几种方式来对数据进行分桶:
1:terms 按照词项的值分桶
2:range 按照指定的数据范围分桶
3:histogram 按照直方图进行分桶,可以认为是range分桶的一种特殊情况
4:cardinality 按照terms分桶,然后计算桶的个数,这种准确来说是terms分区+metric的一种方式
- range 分桶
- histogram 分桶
- cardinality
POST kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"天气的个数": {
"cardinality": {
"field": "OriginWeather"
}
}
}
}
第一步:按照OriginWeather分成若干个桶,第二步计算桶的个数。相当于count(distinct field)
。
2.1:metric aggregation
数学运算。如:
2.3:pipeline aggregation
类似于linux命令中的管道,即在上一个agg的基础上再进行一次聚合分析。
从类型上分为两类,sibling和parent,如下:
2.4:matrix aggregation
对多个字段操作并提供一个结果矩阵???嘛意思???
3:实战
3.1:bucket aggregation
- 按照飞行目的地分组:
测试数据参考这里 。但如果你是按照专栏顺序看的,我想你本地已经有这些数据了。
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"飞行目的地分组": {
"terms": {
"field": "DestCountry"
}
}
}
}
- range和histogram分组
首先准备测试数据:
DELETE /employees
PUT /employees/
{
"mappings" : {
"properties" : {
"age" : {
"type" : "integer"
},
"gender" : {
"type" : "keyword"
},
"job" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 50
}
}
},
"name" : {
"type" : "keyword"
},
"salary" : {
"type" : "integer"
}
}
}
}
PUT /employees/_bulk
{ "index" : { "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : { "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : { "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : { "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : { "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : { "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : { "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : { "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : { "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : { "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : { "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : { "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : { "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : { "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : { "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : { "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : { "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : { "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
range分组例子:
POST employees/_search
{
"size": 0,
"aggs": {
"薪酬区间": {
"range": {
"field": "salary",
"ranges": [
{
"to": 10000
},
{
"from": 10000,
"to": 20000
},
{
"key": "->20000",
"from": 20000
}
]
}
}
}
}
3.2:metric aggregation
按照飞行目的地分组后,对分组的数据取最大最小和平均值。
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"飞行目的地分组": {
"terms": {
"field": "DestCountry"
},
"aggs": {
"票价平均值": {
"avg": {
"field": "AvgTicketPrice"
}
},
"票价最大值": {
"max": {
"field": "AvgTicketPrice"
}
},
"票价最小值": {
"min": {
"field": "AvgTicketPrice"
}
}
}
}
}
}
按照飞行目的地分组后,通过stats取每个分组的统计信息(会列出min,max,avg等常用统计信息)
,并且对每个分组再按照天气做一次分组,并取前5条数据:
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"飞行目的地分组": {
"terms": {
"field": "DestCountry"
},
"aggs": {
"组内统计信息": {
"stats": {
"field": "AvgTicketPrice"
}
},
"weatherrr": {
"terms": {
"field": "DestWeather",
"size": 5
}
}
}
}
}
}
以上的例子是通过agg嵌套的方式先对数据分桶,然后再进行聚合,其实,也可以直接进行metric运行,此时查询到的数据集相当于都在一个桶中:
# 查询最远距离,最近距离DistanceKilometers
POST kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"最远距离": {
"max": {
"field": "DistanceKilometers"
}
},
"最近距离": {
"min": {
"field": "DistanceKilometers"
}
},
"平均距离": {
"avg": {
"field": "DistanceKilometers"
}
}
}
}
3.3:pipeline aggregation
准备数据:
DELETE employees
PUT /employees/_bulk
{ "index" : { "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : { "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : { "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : { "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : { "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : { "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : { "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : { "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : { "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : { "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : { "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : { "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : { "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : { "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : { "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : { "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : { "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : { "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
3.3.1:sibling
与要基于其计算的聚合结果是平级的关系。
- 找出平均工资最低的工作类型
分为3步:1:按照工作类型进行分组 2:计算出每个分组的平均工资 3:找出平均工资最低的分组(使用pipeline aggr)
POST employees/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
},
"min_salary_by_job": {
"min_bucket": {
"buckets_path": "jobs.avg_salary"
}
}
}
}
类似的还有。
- 平均工资最高的工种
POST employees/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
},
"平均工资最高的工种": {
"max_bucket": {
"buckets_path": "jobs.avg_salary"
}
}
}
}
结果:
- 每个工种平均工资的平均工资
POST employees/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
},
"每个工种平均工资的平均工资": {
"avg_bucket": {
"buckets_path": "jobs.avg_salary"
}
}
}
}
结果:
- 多值输出
POST employees/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"avg_salary": {
"avg": {
"field": "salary"
}
}
}
},
"多值输出": {
"stats_bucket": {
"buckets_path": "jobs.avg_salary"
}
}
}
}
结果:
3.3.2:parent
包含在要基于其计算的聚合结果中,即上个聚合结果作为其parent。
4:聚合的作用范围
4.1:什么是聚合的作用范围以及有哪些作用范围
基于哪些文档 进行聚合操作,叫做作用范围,在我们前面的例子中,我们没有加任何的作用范围,所以是针对的所有文档。
es提供了如下几种作用范围:
1:query
2:filter
3:post filter
4:global
4.2:各种作用范围实例
准备数据:
DELETE /employees
PUT /employees/
{
"mappings" : {
"properties" : {
"age" : {
"type" : "integer"
},
"gender" : {
"type" : "keyword"
},
"job" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 50
}
}
},
"name" : {
"type" : "keyword"
},
"salary" : {
"type" : "integer"
}
}
}
}
PUT /employees/_bulk
{ "index" : { "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : { "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : { "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : { "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : { "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : { "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : { "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : { "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : { "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : { "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : { "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : { "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : { "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : { "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : { "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : { "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : { "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : { "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : { "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
4.2.1:query
基于es的query搜索功能先限定agg的作用范围。
POST employees/_search
{
"size": 0,
"query": {
"range": {
"age": {
"gte": 40
}
}
},
"aggs": {
"员工年龄大于40工作们": {
"terms": {
"field": "job.keyword"
}
}
}
}
再比如查询:
POST employees/_search
{
"size": 0,
"query": {
"range": {
"age": {
"gte": 40
}
}
},
"aggs": {
"员工年龄大于40工作们": {
"terms": {
"field": "job.keyword"
}
},
"员工年龄大于40员工最低工资": {
"min": {
"field": "salary"
}
}
}
}
4.2.2:filter
通过filter可以对某个聚合操作设置单独的作用范围,如下查询年龄20到30的最低工资和年龄30到40的最高工资:
POST employees/_search
{
"size": 0,
"aggs": {
"20到30员工": {
"filter": {
"range": {
"age": {
"from": 20,
"to": 30
}
}
},
"aggs": {
"最低工资": {
"min": {
"field": "salary"
}
}
}
},
"30到40员工": {
"filter": {
"range": {
"age": {
"from": 30,
"to": 40
}
}
},
"aggs": {
"最高工资": {
"max": {
"field": "salary"
}
}
}
}
}
}
4.2.3:post filter
用于对最终显式的hits结果做过滤,比如总共有10个文档被应用于聚合,但是并不希望展示这所有的10个文档,而只是在hits中返回满足特定条件
的文档。post filter就是用来表达特定条件的。
也就是说,post filter和agg聚合本身没有毛线关系
,只是用来影响hits结果。我猜因为常规的query过滤会影响aggs聚合,所以增加了post filter吧!
- 按照工作分组并获取工资高于3万的员工信息
POST employees/_search
{
"size": 100,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword"
}
}
},
"post_filter": {
"range": {
"salary": {
"gt": 30000
}
}
}
}
4.2.4:global
默认的query方式的作用范围会影响所有的agg聚合,如果希望某个聚合不受其影响的话,则可以使用global,如:
POST employees/_search
{
"size": 0,
"query": {
"range": {
"age": {
"gte": 40
}
}
},
"aggs": {
"年龄大于40的jobs": {
"terms": {
"field": "job.keyword"
}
},
"所有的": {
"global": {},
"aggs": {
"所有的jobs": {
"terms": {
"field": "job.keyword"
}
}
}
}
}
}