目录
stats 统计 count max min avg sum 5个值
Percentiles rank 统计值小于等于指定值的文档占比
Geo Bounds aggregation 求文档集中的坐标点的范围
Geo Centroid aggregation 求中心点坐标值
filter Aggregation 对满足过滤查询的文档进行聚合计算
Date Range Aggregation 时间范围分组聚合
Date Histogram Aggregation 时间直方图(柱状)聚合
Geo Distance Aggregation 地理距离分区聚合
聚合分析简介
ES聚合分析是什么?
聚合分析是数据库中重要的功能特性,完成对一个查询的数据集中数据的聚合计算,如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。
- 对一个数据集求最大、最小、和、平均值等指标的聚合,在ES中称为指标聚合 metric
- 而关系型数据库中除了有聚合函数外,还可以对查询出的数据进行分组group by,再在组上进行指标聚合。在 ES 中group by 称为分桶,桶聚合 bucketing
ES中还提供了矩阵聚合(matrix)、管道聚合(pipleline),但还在完善中。
ES聚合分析查询的写法
在查询请求体中以aggregations节点按如下语法定义聚合分析:
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
}
[,"meta" : { [<meta_data_body>] } ]?
[,"aggregations" : { [<sub_aggregation>]+ } ]?
}
[,"<aggregation_name_2>" : { ... } ]*
}
//aggregations 也可简写为 aggs
聚合分析的值来源
聚合计算的值可以取字段的值,也可是脚本计算的结果。
指标聚合
max min sum avg
POST /bank/_search?
{
"size": 0,
"aggs": {
"masssbalance": {
"max": {
"field": "balance"
}
}
}
}
//查询所有客户中余额的最大值
POST /bank/_search?
{
"size": 2,
"query": {
"match": {
"age": 24
}
},
"sort": [
{
"balance": {
"order": "desc"
}
}
],
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
}
}
}
//年龄为24岁的客户中的余额最大值
POST /bank/_search?size=0
{
"aggs" : {
"avg_age" : {
"avg" : {
"script" : {
"source" : "doc.age.value"
}
}
},
"avg_age10" : {
"avg" : {
"script" : {
"source" : "doc.age.value + 10"
}
}
}
}}
//值来源于脚本
//查询所有客户的平均年龄是多少
POST /bank/_search?size=0
{
"aggs": {
"sum_balance": {
"sum": {
"field": "balance",
"script": {
"source": "_value * 1.03"
}
}
}
}
}
//指定field,在脚本中用_value 取字段的值
POST /bank/_search?size=0
{
"aggs": {
"avg_age": {
"avg": {
"field": "age",
"missing": 18
}
} }}
POST /bank/_search?size=0
{
"aggs": {
"avg_age": {
"avg": {
"field": "age",
"missing": 18
}
}
}
}
//为缺失值字段,指定值。如未指定,缺失该字段值的文档将被忽略。
文档计数 count
POST /bank/_doc/_count
{
"query": {
"match": {
"age" : 24
}
}
}
Value count 统计某字段有值的文档数
POST /bank/_search?size=0
{
"aggs" : {
"age_count" : { "value_count" : { "field" : "age" } }
}
}
cardinality 值去重计数
POST /bank/_search?size=0
{
"aggs": {
"age_count": {
"cardinality": {
"field": "age"
}
},
"state_count": {
"cardinality": {
"field": "state.keyword"
}
}
}
}
//state的使用它的keyword版
stats 统计 count max min avg sum 5个值
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}
Extended stats
高级统计,比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"extended_stats": {
"field": "age"
}
}
}
}
Percentiles 占比百分位对应的值统计
对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中间的结果,可以理解为:占比为50%的文档的age值 <= 31,或反过来:age<=31的文档数占总命中文档数的50%
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age"
}
}
}
}
"aggregations": {
"age_percents": {
"values": {
"1.0": 20,
"5.0": 21,
"25.0": 25,
"50.0": 31,
"75.0": 35,
"95.0": 39,
"99.0": 40
}
}
}
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age",
"percents" : [95, 99, 99.9]
}
}
}
}
//指定分位值
Percentiles rank 统计值小于等于指定值的文档占比
POST /bank/_search?size=0
{
"aggs": {
"gge_perc_rank": {
"percentile_ranks": {
"field": "age",
"values": [
25,
30
]
}
}
}
}
"aggregations": {
"gge_perc_rank": {
"values": {
"25.0": 26.1,
"30.0": 49.3
}
}
}
Geo Bounds aggregation 求文档集中的坐标点的范围
Geo Centroid aggregation 求中心点坐标值
桶聚合
Terms Aggregation 根据字段值项分组聚合
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age"
}
}
}
}
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 463,
"buckets": [
{ //文档计数的最大偏差值
"key": 31,
"doc_count": 61
}, //未返回的其他项的文档数
{
"key": 39,
"doc_count": 60 //默认情况下返回按文档计数从高到低的前10个分组
},
{
"key": 26,
"doc_count": 59
},
….
]
}
}
- size 指定返回多少个分组
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"size": 20
}
} }}
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"size": 5,
"shard_size":20
}
} }}
//shard_size 指定每个分片上返回多少个分组
//shard_size 的默认值为: 索引只有一个分片:= size多分片:= size * 1.5 + 10
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"size": 5,
"shard_size":20,
"show_term_doc_count_error": true
} } }}
//每个分组上显示偏差值
- order 指定分组的排序
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order" : { "_count" : "asc" }
}
}
}
}
//根据文档计数排序
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order" : { "_key" : "asc" }
}
}
}
}
//根据分组值排序
- 取分组指标值
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"max_balance": "asc"
}
},
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
},
"min_balance": {
"min": {
"field": "balance"
}
} } } }}
- 根据分组指标值排序
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"max_balance": "asc"
}
},
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
}
}
} }}
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"stats_balance.max": "asc"
}
},
"aggs": {
"stats_balance": {
"stats": {
"field": "balance"
}
}
}
} }}
- 筛选分组
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"min_doc_count": 60
}
}
}
}
//用文档计数来筛选
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"include": [20,24]
}
}
}
}
//筛选指定的值列表
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"include" : ".*sport.*",
"exclude" : "water_.*"
}
}
}
}
//正则表达式匹配值
GET /_search
{
"aggs" : {
"JapaneseCars" : {
"terms" : {
"field" : "make",
"include" : ["mazda", "honda"]
}
},
"ActiveCarManufacturers" : {
"terms" : {
"field" : "make",
"exclude" : ["rover", "jensen"]
}
}
}
}
//指定值列表
- 根据脚本计算值分组
GET /_search
{
"aggs" : {
"genres" : {
"terms" : {
"script" : {
"source": "doc['genre'].value",
"lang": "painless"
}
}
}
}
}
- 缺失值处理
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"missing": "N/A"
}
}
}
}
filter Aggregation 对满足过滤查询的文档进行聚合计算
在查询命中的文档中选取复合过滤条件的文档进行聚合
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"filter": {"match":{"gender":"F"}},
"aggs": {
"avg_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
Filters Aggregation 多个过滤组聚合计算
PUT /logs/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }
GET logs/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"errors" : { "match" : { "body" : "error" }},
"warnings" : { "match" : { "body" : "warning" }}
}
} } }}
GET logs/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"other_bucket_key": "other_messages",
"filters" : {
"errors" : { "match" : { "body" : "error" }},
"warnings" : { "match" : { "body" : "warning" }}
}
}
}
}
}
//为其他值组指定key
Range Aggregation 范围分组聚合
POST /bank/_search?size=0
{
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{"to":25},
{"from": 25,"to": 35},
{"from": 35}
]
},
"aggs": {
"bmax": {
"max": {
"field": "balance"
}
}
} } }}
POST /bank/_search?size=0
{
"aggs": {
"age_range": {
"range": {
"field": "age",
"keyed": true,
"ranges": [
{"to":25,"key": "Ld"},
{"from": 25,"to": 35,"key": "Md"},
{"from": 35,"key": "Od"}
]
}
}
}
}
//为组指定key
Date Range Aggregation 时间范围分组聚合
POST /sales/_search?size=0
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{ "to": "now-10M/M" },
{ "from": "now-10M/M" }
]
}
}
}
}
Date Histogram Aggregation 时间直方图(柱状)聚合
就是按天、月、年等进行聚合统计。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 间隔聚合或指定的时间间隔聚合。
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
}
}
}
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "90m"
}
}
}
}
Missing Aggregation 缺失值的桶聚合
缺失指定字段值的文档作为一个桶进行聚合分析
POST /bank/_search?size=0
{
"aggs" : {
"account_without_a_age" : {
"missing" : { "field" : "age" }
}
}
}