什么是聚合分析(aggregation)?
聚合分析es除搜索功能外提供的针对es数据做统计分析的功能
特点:
- 功能丰富:提供了Bucketing,Metric,Matrix,Pipeline等多种分析方式
- 实时性高:所有等计算结果都是实时返回等,而hadoop是T+1级别,也就是隔天
aggregation应用场景:
统计该商户一周每天的订单
统计一月每天的金额是多少
简单的说就是TOB端的数据看板TOC端的雷达统计
elasticsearch主要分析方式介绍:
Bucketing:分桶类型,类似SQL中的分组(GROUP BY)语法 官网直通车
Metric:指标分析类型,比如计算最大值、最小值、平均值、总和等等 官网直通车
Matrix:矩阵分析,比如每场测量样本从均值分布的程度,每个字段的平均值等 官网直通车
实战演练:
Bucket聚合分析:
bucket意为桶,相当于分桶策略,上面有说类似于group by语法,分桶策略如下:
age<20的放进A桶,20<age<50的放进B桶,age>50的放进C桶
常见的bucket分析如下:
Terms ,Range, Date Range, Histogram, Date Histogram
Terms
Terms: 最简单的分桶策略,直接按照term来分桶,如果是text类型,则按照分词后的结果分桶
案例:统计该索引下字段的值出现次数
请求参数:
GET /my_index1/_search
{
"size":0,
"aggs": {
"group_by_terms": {
"terms": {
"field": "terms.keyword"
}
}
}
}
返回:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ABcdeFGHIjkhhh",
"doc_count": 2
},
{
"key": "充电器 ",
"doc_count": 2
},
{
"key": "ABcdeFGHIjk",
"doc_count": 1
}
]
}
}
}
Range
Range: 通过指定数值的范围来设定分桶规则
案例如下:注意看to,from的对比*号
请求参数
GET /range/_search
{
"size": 0,
"aggs": {
"range_age": {
"range": {
"field": "age",
"ranges": [
{
"to": 25
},
{
"from": 25,
"to": 35
},
{
"from": 35
}
]
}
}
}
}
返回
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 9,
"max_score": 0,
"hits": []
},
"aggregations": {
"range_age": {
"buckets": [
{
"key": "*-25.0",
"to": 25,
"doc_count": 9
},
{
"key": "25.0-35.0",
"from": 25,
"to": 35,
"doc_count": 0
},
{
"key": "35.0-*",
"from": 35,
"doc_count": 0
}
]
}
}
}
Date Range
**Range: 通过指定日期的范围来设定分桶规则
顾名思义,案例省略了,自己写个玩玩
Historgram
Historgram:直方图,以文档最小值开始,固定间隔的策略来分割数据
这个地方我也没太理解官方为何如此设计,看语法应该是统计min~max之间的
案例如下:
请求:
GET historgram/_search
{
"size": 0,
"aggs": {
"hist_age": {
"histogram": {
"field": "age",
"interval": 10,
"extended_bounds":{
"min":30,
"max":60
}
}
}
}
}
返回:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0,
"hits": []
},
"aggregations": {
"hist_age": {
"buckets": [
{
"key": -20,
"doc_count": 1
},
{
"key": -10,
"doc_count": 1
},
{
"key": 0,
"doc_count": 3
},
{
"key": 10,
"doc_count": 0
},
{
"key": 20,
"doc_count": 0
},
{
"key": 30,
"doc_count": 0
},
{
"key": 40,
"doc_count": 0
},
{
"key": 50,
"doc_count": 0
},
{
"key": 60,
"doc_count": 0
}
]
}
}
}
Date Histogram
Date Histogram: 针对日期的直方图或者柱状图,是时序分析中常用的聚合分析类型
Metric聚合分析
Metric聚合分析分为单值分析和多值分析两类:
单值分析,只输出一个分析结果
1.min,max,avg,sum
2.cardinality
多值分析,输出多个分析结果
1.stats,extended stats
2.percentile,percentile rank
3.top hits
下面举两例说明:
其它介绍一下意思,具体转官网细看,直通车在文章首页
min,max,avg,sum 返回数值字段的最小值/最大值/平均值/总和:
cardinality:意为集合的势,或者基数,是指不同数值的个数,类似SQL中的distinct count概念,理解为去重统计即可
stats,extended stats
stats:返回一系列数值类型的统计值,包含min、max、avg、sum和count
extended stats:对stats的扩展,包含了更多的统计数据,比如方差、标准差等
min
返回最小值
请求参数:
GET my_index/_search
{
"size": 0, //不返回文档列表
"aggs": { //聚和方式
"minCount": {//统计返回类型
"min": { //统计类型(最小值/最大值/平均值----->min/max/avg/sum)
"field": "min" //统计哪个字段
}
}
}
}
返回:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": { //命中
"total": 3, //命中数量
"max_score": 0,
"hits": []
},
"aggregations": {
"minCount": { //返回的key对象
"value": 1 //返回的值
}
}
}
Percentile
Percentile: 百分位数统计
GET test1001/_search
{
"size": 0,
"aggs": {
"per_age": {
"percentiles": {
"field": "age",
"percents": [
1,
5,
25
]
}
}
}
}
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"per_age": {
"values": {
"1.0": 1.04,
"5.0": 1.2,
"25.0": 2
}
}
}
}